Technical Report

An exploratory study on Household Finance and Consumption of Italian households

A group Project for the course of Statistical Data Analysis

By
Mushtari Khan, Laila Arzuman Ara, Yuva Priyanka Manda,
Mahadevan KS, Hekmatullah Himmat

Introduction

The Household Finance and Consumption Survey data from European survey data of Banca D’Italia provides a comprehensive view of household balance sheets and related economic and demographic variables. It contains a vast array of variables related to household financial information, demographics, and assets. Each row of the dataset represents a value based on households’ earnings, and the variables described in the report refer to different groups of households based on demographic or economic characteristics.

The dataset includes information on household weight, age of the reference person, number of household members, type of household, real estate properties’ value, income from various sources such as self-employment and pensions, as well as information on net worth and debt. Additionally, it provides detailed information on the financial and non-financial assets, liabilities, income, and consumption of households in Italy.

The data was collected by sampling households from each region of the country, making it a representative dataset. With a broad range of variables related to household finances, including employment status, income, savings, debt, property, assets, and consumption patterns, this dataset can be a valuable resource for researchers and policymakers analyzing household finances in Italy.

Dataset Link : Household Finance and Consumption Survey

Objective

The objective of this exploratory analysis on the dataset is to gain a better understanding of the financial and non-financial characteristics of households in Italy. This analysis will aim to identify trends in household income, wealth, debt, and consumption, as well as explore the relationships between these variables and demographic and economic characteristics of households. By analyzing the dataset, we hope to generate insights about household finance and consumption in Italy.

Workflow

We divided the pipeline of Statistical Data analysis into different sprints like in agile methodology to achieve desired results. The following are the tasks that were done as part of the each sprint.

Sprints

Sprint

Session

Task

Sprint 1

Data

Domain Understanding

About the Dataset, Context, and Variable discussion

Meta Data Discussion, Typology of Variables (str)

Fixing what questions we can answer, Converting questions to statistical problems

What can be inferred

Data Integration, Merging CSV files , Importing required Libraries

Sprint 2

Variable Selection

Summarize

EDA

Data Transformation, Missing Data Imputation, Variable Recoding

Structure and summary, After Cleaning

Outlier Detection, Based on boxplots for metric variable, Replacing with mean values

Sprint 3

Cross tabulations, Findings Discussion

Data Visualization Boxplots, histograms

Correlation, Heatmaps, scatter plots, Correlation Matrix of whole Dataset

Sprint 4

Summarize the results from EDA

CDA

How the data is distributed, Skewness, Distribution graphs for all variables

Based on distribution selection of test

Fix Hypothesis to be tested, Hypothesis Statements

Tests Selection, (NON) Parametric tests

Test Results Discussion, How many of the hypothesis were rejected

Sprint 5

Model Selection, What models will fit the data

Feature Importance, PCA

Regression

Clustering, Decision Tree, Confusion Matrix

Summarize the results from CDA

The dataset comprises several CSV files, some of which contain non-core and core variables. Among them, we focused on two CSV files, D1.csv (127 columns) and H1.csv (920 columns), which together had over 1000 columns in total. However, as many of the columns in D1 file were derived from H1 file, we narrowed our focus to approximately 20 columns in H1 file that had data about expenditure. To determine which variables to select and which statistical questions to ask, we brainstormed various possibilities based on the available columns.

The raw sample data of the files D1 and H1 from the survey is as follows,

ID

survey

SA0010

SA0100

IM0100

HW0010

DWHOHO

DHAGEH1B

DH0001

DH0006

DH0004

DHHTYPE

DH0002

DA1110

DA1120

DA1121

DA1130

DA1131

DA1140

DA2101

DA2102

DA2103

DA2104

DA2105

DA2106

DA2107

DA2108

DA2109

DL1110

DL1120

DL1200

DL1100

DL2100

DL2110

DL2200

DL2000

DI1412

DI1100

DI1200

DI1300

DI1400

DI1500

DI1600

DI1700

DA1000

DA1200

DA1400

DA2100

DA3001

DL1000

DI2000

DN3001

DA1000i

DA2100i

DA1110i

DA1120i

DA1121i

DA1130i

DA1131i

DA1140i

DA1400i

DA1200i

DA2101i

DA2102i

DA2103i

DA2104i

DA2105i

DA2106i

DA2107i

DA2108i

DA2109i

DL1000i

DL1100i

DL1110i

DL1120i

DL1200i

DODARATIO

DODIRATIO

DODSTOTAL

DODSTOTAL40P

DODSMORTG

DHAQ01

DHNQ01

DHIQ01

DI1300i

DA1122

DA1122i

DATOP10

DHHST

DI1100i

DI1200i

DI1400i

DI1500i

DI1600i

DI1700i

DI1800

DI1800i

DITOP10

DL1231

DL1231i

DODIRATIOM

DATOP10EA

DITOP10EA

DL1210

DL1210i

DL1220

DL1220i

DL1230

DL1230i

DL2000i

DL2100i

DL2110i

DL2120

DL2120i

DL2200i

DL2210

DL2210i

DNTOP10EA

DODSTOTALp

DODSTOTAL40Pp

DL1232

DL1232i

DHAGEH1

DHEDUH1

DHEMPH1

DHGENDERH1

DHIDH1

IT100000173001

1

173

IT

1

898.7334

179.7467

65

1

1

0

52

1.0

150,000

50,000

10,000

1,000

20,000.000

172.053703

31,931.947

211,000

200,000

20,000.000

231,000.00

32,104.001

231,000.00

1

1

1

1

0

1

1

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

4

4

4

0

50,000

1

60

1

0

0

1

1

0

0

0

60

0

60

50

0

0

0

0

0

0

0

0

0

70

0

66

3

4

1

1

IT100000375001

1

375

IT

1

3,652.3074

730.4615

85

1

1

0

52

1.0

150,000

500

0.000

9,344.991

150,500

150,000

0.000

150,500.00

9,344.991

150,500.00

1

1

1

0

0

0

1

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

3

3

1

0

0

40

1

0

0

0

1

0

0

0

10

0

50

5

0

0

0

0

0

0

0

0

0

50

0

85

1

5

2

1

IT100000633001

1

633

IT

1

958.0087

191.6017

80

1

1

0

52

1.0

130,000

1,000

2,000

600.000

5.161611

15,652.588

133,000

130,000

600.000

133,600.00

15,657.750

133,600.00

1

1

1

0

0

1

1

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

3

3

2

0

0

40

1

0

0

1

1

0

0

0

20

0

40

20

0

0

0

0

0

0

0

0

0

50

0

80

1

4

2

1

IT100000923001

1

923

IT

1

682.1561

136.4312

80

1

1

0

52

1.0

280,000

590,000

200

500.000

4.301343

7,150.000

3,000

870,200

870,000

500.000

870,700.00

10,154.301

870,700.00

1

1

1

1

0

0

1

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

5

5

1

0

590,000

1

95

1

0

0

1

1

0

1

0

10

0

95

10

0

0

0

0

0

0

0

0

0

95

0

82

1

5

2

1

IT100001367001

1

1,367

IT

1

890.2372

178.0474

85

2

2

0

7

1.5

60,000

1,000

13,000.000

111.834907

12,053.160

61,000

60,000

13,000.000

74,000.00

12,164.995

74,000.00

1

1

1

0

0

0

1

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

2

2

1

0

0

30

1

0

0

1

1

0

0

0

10

0

30

10

0

0

0

0

0

0

0

0

0

40

0

85

1

4

1

1

IT100001763001

1

1,763

IT

1

5,538.9744

1,107.7949

65

2

2

0

7

1.5

25,000

1,000

8,623.911

74.188787

8,060.000

26,000

25,000

8,623.911

34,623.91

8,134.189

34,623.91

1

1

1

0

0

1

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

2

2

1

0

0

20

1

0

0

1

1

0

0

0

5

0

30

5

0

0

0

0

0

0

0

0

0

30

0

67

1

4

1

1

ID

survey

SA0010

SA0100

IM0100

HW0010

HB0100

fHB0100

hb0100_B

fhb0100_b

HB0200

fHB0200

HB0300

fHB0300

HB0400

fHB0400

HB0410

fHB0410

HB0500

fHB0500

HB0600

fHB0600

HB0700

fHB0700

HB0800

fHB0800

HB0900

fHB0900

HB1000

fHB1000

HB1010

fHB1010

HB1101

fHB1101

HB1102

fHB1102

HB1103

fHB1103

HB1131a

fHB1131a

HB1131b

fHB1131b

HB1131c

fHB1131c

HB1132a

fHB1132a

HB1132b

fHB1132b

HB1132c

fHB1132c

HB1133a

fHB1133a

HB1133b

fHB1133b

HB1133c

fHB1133c

HB1151

fHB1151

HB1152

fHB1152

HB1153

fHB1153

HB1201a

fHB1201a

HB1201b

fHB1201b

HB1201c

fHB1201c

HB1201d

fHB1201d

HB1201e

fHB1201e

HB1201f

fHB1201f

HB1201g

fHB1201g

HB1201h

fHB1201h

HB1201i

fHB1201i

HB1202a

fHB1202a

HB1202b

fHB1202b

HB1202c

fHB1202c

HB1202d

fHB1202d

HB1202e

fHB1202e

HB1202f

fHB1202f

HB1202g

fHB1202g

HB1202h

fHB1202h

HB1202i

fHB1202i

HB1203a

fHB1203a

HB1203b

fHB1203b

HB1203c

fHB1203c

HB1203d

fHB1203d

HB1203e

fHB1203e

HB1203f

fHB1203f

HB1203g

fHB1203g

HB1203h

fHB1203h

HB1203i

fHB1203i

HB1301

fHB1301

HB1302

fHB1302

HB1303

fHB1303

HB1401

fHB1401

HB1402

fHB1402

HB1403

fHB1403

HB1501

fHB1501

HB1502

fHB1502

HB1503

fHB1503

HB1601

fHB1601

HB1602

fHB1602

HB1603

fHB1603

HB1701

fHB1701

HB1702

fHB1702

HB1703

fHB1703

HB1801

fHB1801

HB1802

fHB1802

HB1803

fHB1803

HB1901

fHB1901

HB1902

fHB1902

HB1903

fHB1903

HB2001

fHB2001

HB2002

fHB2002

HB2003

fHB2003

HB2100

fHB2100

HB2200

fHB2200

HB2300

fHB2300

HB2400

fHB2400

HB2410

fHB2410

HB2501

fHB2501

HB2502

fHB2502

HB2503

fHB2503

HB2601a

fHB2601a

HB2601b

fHB2601b

HB2601c

fHB2601c

HB2601d

fHB2601d

HB2601e

fHB2601e

HB2601f

fHB2601f

HB2602a

fHB2602a

HB2602b

fHB2602b

HB2602c

fHB2602c

HB2602d

fHB2602d

HB2602e

fHB2602e

HB2602f

fHB2602f

HB2603a

fHB2603a

HB2603b

fHB2603b

HB2603c

fHB2603c

HB2603d

fHB2603d

HB2603e

fHB2603e

HB2603f

fHB2603f

HB2701

fHB2701

HB2702

fHB2702

HB2703

fHB2703

HB2801

fHB2801

HB2802

fHB2802

HB2803

fHB2803

HB2900

fHB2900

HB3000

fHB3000

HB3010

fHB3010

HB3101

fHB3101

HB3102

fHB3102

HB3103

fHB3103

HB3131a

fHB3131a

HB3131b

fHB3131b

HB3131c

fHB3131c

HB3132a

fHB3132a

HB3132b

fHB3132b

HB3132c

fHB3132c

HB3133a

fHB3133a

HB3133b

fHB3133b

HB3133c

fHB3133c

HB3151

fHB3151

HB3152

fHB3152

HB3153

fHB3153

HB3201a

fHB3201a

HB3201b

fHB3201b

HB3201c

fHB3201c

HB3201d

fHB3201d

HB3201e

fHB3201e

HB3201f

fHB3201f

HB3201g

fHB3201g

HB3201h

fHB3201h

HB3201i

fHB3201i

HB3202a

fHB3202a

HB3202b

fHB3202b

HB3202c

fHB3202c

HB3202d

fHB3202d

HB3202e

fHB3202e

HB3202f

fHB3202f

HB3202g

fHB3202g

HB3202h

fHB3202h

HB3202i

fHB3202i

HB3203a

fHB3203a

HB3203b

fHB3203b

HB3203c

fHB3203c

HB3203d

fHB3203d

HB3203e

fHB3203e

HB3203f

fHB3203f

HB3203g

fHB3203g

HB3203h

fHB3203h

HB3203i

fHB3203i

HB3301

fHB3301

HB3302

fHB3302

HB3303

fHB3303

HB3401

fHB3401

HB3402

fHB3402

HB3403

fHB3403

HB3501

fHB3501

HB3502

fHB3502

HB3503

fHB3503

HB3601

fHB3601

HB3602

fHB3602

HB3603

fHB3603

HB3701

fHB3701

HB3702

fHB3702

HB3703

fHB3703

HB3801

fHB3801

HB3802

fHB3802

HB3803

fHB3803

HB3901

fHB3901

HB3902

fHB3902

HB3903

fHB3903

HB4001

fHB4001

HB4002

fHB4002

HB4003

fHB4003

HB4100

fHB4100

HB4200

fHB4200

HB4300

fHB4300

HB4310

fHB4310

HB4400

fHB4400

HB4500

fHB4500

HB4510a

fHB4510a

HB4510b

fHB4510b

HB4510c

fHB4510c

HB4510d

fHB4510d

HB4510e

fHB4510e

HB4510f

fHB4510f

HB4600

fHB4600

HB4700

fHB4700

HB4710

fHB4710

HB4800

fHB4800

HB4810

fHB4810

HC0100

fHC0100

HC0110

fHC0110

HC0200

fHC0200

HC0210

fHC0210

HC0220

fHC0220

HC0300

fHC0300

HC0310

fHC0310

HC0320

fHC0320

HC0330

fHC0330

HC0340

fHC0340

HC0351a

fHC0351a

HC0351b

fHC0351b

HC0351c

fHC0351c

HC0351d

fHC0351d

HC0351e

fHC0351e

HC0351f

fHC0351f

HC0351g

fHC0351g

HC0351h

fHC0351h

HC0351i

fHC0351i

HC0352a

fHC0352a

HC0352b

fHC0352b

HC0352c

fHC0352c

HC0352d

fHC0352d

HC0352e

fHC0352e

HC0352f

fHC0352f

HC0352g

fHC0352g

HC0352h

fHC0352h

HC0352i

fHC0352i

HC0353a

fHC0353a

HC0353b

fHC0353b

HC0353c

fHC0353c

HC0353d

fHC0353d

HC0353e

fHC0353e

HC0353f

fHC0353f

HC0353g

fHC0353g

HC0353h

fHC0353h

HC0353i

fHC0353i

HC0361

fHC0361

HC0362

fHC0362

HC0363

fHC0363

HC0370

fHC0370

HC0400

fHC0400

HC0410

fHC0410

HC0501a

fHC0501a

HC0501b

fHC0501b

HC0501c

fHC0501c

HC0501d

fHC0501d

HC0501e

fHC0501e

HC0501f

fHC0501f

HC0501g

fHC0501g

HC0501h

fHC0501h

HC0501i

fHC0501i

HC0502a

fHC0502a

HC0502b

fHC0502b

HC0502c

fHC0502c

HC0502d

fHC0502d

HC0502e

fHC0502e

HC0502f

fHC0502f

HC0502g

fHC0502g

HC0502h

fHC0502h

HC0502i

fHC0502i

HC0503a

fHC0503a

HC0503b

fHC0503b

HC0503c

fHC0503c

HC0503d

fHC0503d

HC0503e

fHC0503e

HC0503f

fHC0503f

HC0503g

fHC0503g

HC0503h

fHC0503h

HC0503i

fHC0503i

HC0601

fHC0601

HC0602

fHC0602

HC0603

fHC0603

HC0701

fHC0701

HC0702

fHC0702

HC0703

fHC0703

HC0801

fHC0801

HC0802

fHC0802

HC0803

fHC0803

HC0901

fHC0901

HC0902

fHC0902

HC0903

fHC0903

HC1001

fHC1001

HC1002

fHC1002

HC1003

fHC1003

HC1100

fHC1100

HC1200

fHC1200

HC1300

fHC1300

HC1310a

fHC1310a

HC1310b

fHC1310b

HC1320

fHC1320

HC1400

fHC1400

HD0100

fHD0100

HD0200

fHD0200

HD0210

fHD0210

HD0301

fHD0301

HD0302

fHD0302

HD0303

fHD0303

HD0401

fHD0401

HD0402

fHD0402

HD0403

fHD0403

HD0501

fHD0501

hd0501_B

fhd0501_b

HD0502

fHD0502

hd0502_B

fhd0502_b

HD0503

fHD0503

hd0503_B

fhd0503_b

HD0601a

fHD0601a

HD0601b

fHD0601b

HD0601c

fHD0601c

HD0601d

fHD0601d

HD0601e

fHD0601e

HD0601f

fHD0601f

HD0602a

fHD0602a

HD0602b

fHD0602b

HD0602c

fHD0602c

HD0602d

fHD0602d

HD0602e

fHD0602e

HD0602f

fHD0602f

HD0603a

fHD0603a

HD0603b

fHD0603b

HD0603c

fHD0603c

HD0603d

fHD0603d

HD0603e

fHD0603e

HD0603f

fHD0603f

HD0701

fHD0701

HD0702

fHD0702

HD0703

fHD0703

HD0801

fHD0801

HD0802

fHD0802

HD0803

fHD0803

HD0900

fHD0900

HD1000

fHD1000

HD1010

fHD1010

HD1100

fHD1100

HD1110

fHD1110

HD1200

fHD1200

HD1210

fHD1210

HD1300

fHD1300

HD1310a

fHD1310a

HD1310b

fHD1310b

HD1310c

fHD1310c

HD1310d

fHD1310d

HD1310e

fHD1310e

HD1310f

fHD1310f

HD1310g

fHD1310g

HD1320a

fHD1320a

HD1320b

fHD1320b

HD1320c

fHD1320c

HD1320d

fHD1320d

HD1320e

fHD1320e

HD1320f

fHD1320f

HD1320g

fHD1320g

HD1330

fHD1330

HD1400

fHD1400

HD1410a

fHD1410a

HD1410b

fHD1410b

HD1410c

fHD1410c

HD1410d

fHD1410d

HD1420

fHD1420

HD1500

fHD1500

HD1510

fHD1510

HD1520

fHD1520

HD1600

fHD1600

HD1610

fHD1610

HD1620

fHD1620

HD1700

fHD1700

HD1710

fHD1710

HD1800

fHD1800

HD1900

fHD1900

HD1920

fHD1920

HG0100

fHG0100

HG0110

fHG0110

HG0200

fHG0200

HG0210

fHG0210

HG0300

fHG0300

HG0310

fHG0310

HG0400

fHG0400

HG0410

fHG0410

HG0500

fHG0500

HG0510

fHG0510

HG0600

fHG0600

HG0610

fHG0610

HG0700

fHG0700

HG0800

fHG0800

HH0100

fHH0100

HH0110

fHH0110

HH0201

fHH0201

HH0202

fHH0202

HH0203

fHH0203

HH0301a

fHH0301a

HH0301b

fHH0301b

HH0301c

fHH0301c

HH0301d

fHH0301d

HH0301e

fHH0301e

HH0301f

fHH0301f

HH0301g

fHH0301g

HH0301h

fHH0301h

HH0301i

fHH0301i

HH0302a

fHH0302a

HH0302b

fHH0302b

HH0302c

fHH0302c

HH0302d

fHH0302d

HH0302e

fHH0302e

HH0302f

fHH0302f

HH0302g

fHH0302g

HH0302h

fHH0302h

HH0302i

fHH0302i

HH0303a

fHH0303a

HH0303b

fHH0303b

HH0303c

fHH0303c

HH0303d

fHH0303d

HH0303e

fHH0303e

HH0303f

fHH0303f

HH0303g

fHH0303g

HH0303h

fHH0303h

HH0303i

fHH0303i

HH0401

fHH0401

HH0402

fHH0402

HH0403

fHH0403

HH0501

fHH0501

HH0502

fHH0502

HH0503

fHH0503

HH0601

fHH0601

HH0602

fHH0602

HH0603

fHH0603

HH0700

fHH0700

HI0100

fHI0100

HI0200

fHI0200

HI0210

fHI0210

HI0220

fHI0220

HI0300

fHI0300

HI0310

fHI0310

HI0400a

fHI0400a

HI0400b

fHI0400b

HI0400c

fHI0400c

HI0400d

fHI0400d

HI0400e

fHI0400e

HI0400f

fHI0400f

HI0400g

fHI0400g

HI0400h

fHI0400h

HI0400i

fHI0400i

HI0400j

fHI0400j

HI0400k

fHI0400k

HI0400l

fHI0400l

HI0500

fHI0500

HI0600

fHI0600

HI0700a

fHI0700a

HI0700b

fHI0700b

HI0700c

fHI0700c

HI0700d

fHI0700d

HI0700e

fHI0700e

HI0700f

fHI0700f

HI0700g

fHI0700g

HI0800

fHI0800

SA0110

fSA0110

SA0200

fSA0200

SA0210

fSA0210

sb1000

fsb1000

SC0100

fSC0100

IT100000173001

1

173

IT

1

898.7334

160

1

9

1

10

1

1

1

0

0

0

1

1

2,004

1

100,000

1

150,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

0

0

1

1

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

100

1

0

0

50,000

1

0

0

0

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

2,051

10,000

1

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

1

1

1,000

1

2

1

0

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

20,000

1

2

1

0

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

2

1

2

1

0

2

1

0

2

1

0

2

1

0

1

1

172.053703

5,050

2,051

2,051

2

1

0

2

1

2

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

300

1

200

1

250.00000

1

1,000

1

1

1

41.66667

1

2

1

2

1

2

1

2,051

1

1

2

1

1

1

2

1

1

1

2

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

1

1

173

1

2,014

1

2,010

1

2,015

1

3

1

IT100000375001

1

375

IT

1

3,652.3074

120

1

8

1

44

1

1

1

0

0

0

1

1

1,970

1

5,000

1

150,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

2,051

0

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

1

1

500

1

2

1

0

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

4,053

2

1

0

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

4

1

2

1

0

2

1

0

2

1

0

2

1

0

2

1

0

2,051

2,051

2

1

0

3

1

1

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

300

1

0

1

16.66667

1

790

1

2

1

0

2

1

2

1

2

1

2,051

1

1

2

1

2

1

2

1

2

1

2

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

2

1

375

1

2,014

1

2,010

1

2,015

1

3

1

IT100000633001

1

633

IT

1

958.0087

100

1

7

1

49

1

1

1

0

0

0

3

1

1,965

1

1,051

130,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

2,051

1,000

1

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

1

1

2,000

1

1

1

1,000

1

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

1

1

600.000

1

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

3

1

2

1

0

2

1

0

2

1

0

2

1

0

1

1

5.161611

5,050

2,051

2,051

2

1

0

2

1

2

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

400

1

30

1

150.00000

1

1,000

1

2

1

0

2

1

2

1

2

1

2,051

2

1

2

1

1

1

2

1

2

1

2

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

1

1

633

1

2,014

1

2,010

1

2,015

1

2

1

IT100000923001

1

923

IT

1

682.1561

180

1

9

1

46

1

1

1

0

0

0

2

1

1,968

1

6,000

1

280,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

2

1

1

1

3

1

0

5

1

2,051

2,051

2,051

2,051

2,051

5

1

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

100

1

100

1

0

240,000

1

350,000

1

0

0

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

2,051

0

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

1

1

200

1

2

1

0

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

500

4,053

2

1

0

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

4

1

2

1

0

2

1

0

1

1

3,000

5,050

2

1

0

1

1

4.301343

5,050

2,051

2,051

2

1

0

2

1

2

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

350

1

0

1

91.66667

1

800

1

2

1

0

2

1

2

1

2

1

2,051

1

1

2

1

2

1

2

1

2

1

2

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

2

1

923

1

2,014

1

2,010

1

2,015

1

2

1

IT100001367001

1

1,367

IT

1

890.2372

90

1

6

1

49

1

1

1

0

0

0

1

1

1,965

1

5,000

1

60,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

2,051

0

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

1

1

1,000

1

2

1

0

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

1

1

13,000.000

1

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

2

1

2

1

0

2

1

0

2

1

0

2

1

0

1

1

111.834907

5,050

2,051

2,051

2

1

0

2

1

1

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

250

1

50

1

83.33333

1

750

1

2

1

0

2

1

2

1

2

1

2,051

2

1

2

1

1

1

2

1

2

1

2

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

2

1

1,367

1

2,014

1

2,010

1

2,015

1

2

1

IT100001763001

1

1,763

IT

1

5,538.9744

45

1

3

1

37

1

1

1

0

0

0

3

1

1,977

1

1,051

25,000

1

2

1

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

2,051

1,000

1

2

1

2,051

2,051

2,051

2,051

2,051

2,051

0

2

1

0

2

1

0

2

1

0

2

1

0

0

2

1

0

0

2

1

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

2,051

2,051

2,051

2,051

2,051

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

2,051

2,051

2,051

2

1

0

0

2,051

2,051

2,051

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

1

1

8,623.911

4,053

2

1

0

0

0

2,051

2,051

0

2,051

0

0

0

2,051

2,051

0

2,051

2,051

2

1

0

0

0

0

0

2

1

0

0

2

1

0

0

2

1

0

4

1

2

1

0

2

1

0

2

1

0

2

1

0

1

1

74.188787

5,050

2,051

2,051

2

1

0

2

1

2

1

2

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

300

1

0

1

29.16667

1

800

1

2

1

0

2

1

2

1

2

1

2,051

2

1

2

1

2

1

2

1

1

1

1

1

2,051

2

1

3

1

3

1

0

0

0

0

0

0

0

2

1

1,763

1

2,014

1

2,010

1

2,015

1

1

1

The following table lists outs the variables that were considered for this study and a short description about them derived from the meta data of the dataset.

Variable

Description

Number_of_Household_Members

Number of household members, all household members included

Number_of_Household_Members_in_Employment

Number of persons for which PE0100a (Labour status, main) = 1 ( 'doing regular work for pay/ self-employed/ family business') or 2 ('sick, maternity/other leave, planning to return to work').

Household_Type

House hold Composition ; 51 - One adult, younger than 65 years 52 - One adult, 65 years and over 6 - Two adults younger than 65 years 7 - Two adults, at least one aged 65 years and over 8 - Three or more adults 9 - Single parent with dependent children 10 - Two adults with one dependent child 11 - Two adults with two dependent children 12 - Two adults with three or more dependent children 13 - Three or more adults with dependent children

Value_of_Household_VehicleS

Represents the value of household vehicles

Valuables

Value of other valuables

Deposits

Value of Deposits

Mutual_Funds

Value of Mutual Funds

Bonds

Value of Bonds

Employee_Income

Total employee income of the household

Self_Employment_income

Sum of gross self employment income

Rental_Income

Rental income from real estate property

Has_Rental_Income

Has rental income from real estate property

Financial_assets_Income

Value of Income through Financial assests

Pension_Income

Income received as Pensions

Total_Real_Assets

Total real assets 1 (incl. business wealth, vehicles and valuables)

Total_Financial_Assets

Total financial assets 1 (excl. public and occupational pension plans)

Has_Real_Assets

Do you have Real Assets?

Has_Financial_Assets

Do you have Financial assests?

Has_Vehicles

Do you have Vehicle?

Has_Valuables

Do you have any other valuables?

Value_of_Self_employment_Businesses

Income received from Self employment business

Has_Real_Estate_Wealth

Do you have real estate wealth?

Has_Deposits

Do you have deposits?

Has_Mutual_Funds

Do you have mutual funds?

Has_Bonds

Do you have bonds?

Has_Shares

Do you have shares?

Has_Debt

Do you have debt?

Housing_Status

Households housing status 1 - Owner - outright 2 - Owner - with mortgage 3 - Renter/Other

Has_Employee_Income

Do you have income through employment?

Has_Self_Employee_Income

Do you have income through self employment?

Has_Financial_assets_Income

Do you have income through financial assets?

Has_Income_From_Pensions

Do you have income through pensions?

Income_From_Other_Sources

Value of income through other sources

Has_Income_From_Other_Sources

Do you income from other sources?

Credit_Card_Debt

Value of Credit card debt

Has_Credit_Card_Debt

Do you have credit card employment?

Way_Of_Acquring_Property

How (did you/your household) acquire the (part of the) residence (you own/your household owns): did you purchase it, did you construct it yourself, did you inherit it or did you receive it as a gift?

Monthly_Amount_Paid_As_Rent

What is the monthly amount paid as rent (please exclude utilities, heating, etc. if possible)?

Ownership_of_Cars

(Do you/Does anyone in your household) own any cars?

Total_Value_of_Cars

For the cars that you/your household own, if you sold them now, about how much do you think you could get?

Has_Other_Vehicles

(Do you/does anyone in your household) own any other type of vehicle, such as motorbikes, trucks, vans, planes, boats or yachts or any other vehicle such as trailers, caravans, etc.?

Value_Of_Other_Vehicles

If (you/your household) decided to sell (this vehicle/these vehicles) now, how much do you think you would get?

Ownership_Of_Other_Valuables

(Do you/Does you household) own any valuables such as jewellery, works of art, antiques, etc.?

Value_Of_Other_Valuables

In total, approximately how much do you think all these valuables would bring if you sold them?

Household_Has_a_Credit_Card

Do you or any other member of the household have credit cards other than ones paid by employers? (Do not consider here debit cards, i.e. cards where the money spent is immediately deducted from your bank account).

Has_Private_Loans

Do you have loans from relatives or friends that you are expected to repay?

No_of_PrivateLoans

Number of private loans a household has taken

Has_Applied_for_Loan_Credit

In the last three years, have you (or any member of your household) applied for a loan or other credit?

Household_Owns_Saving_accounts

Do you/does anyone in your household) have any saving accounts, time deposits, certificates of deposit or other such deposits?

Value_of_Saving_Accounts

Positive account balances are summed up as part of assets in HD1210

Investment_Attitudes

the amount of financial risk that you (and your husband/wife/partner) are willing to take when you save or make investments? 1- Take substantial financial risks expecting to earn substantial returns 2 - Take above average financial risks expecting to earn above average returns 3 - Take average financial risks expecting to earn average returns 4 - Not willing to take any financial risk

Amount_spent_on_Food_at_Home

How much does (you/your household) spend in a typical month on food and beverages at home?

Amount_Spent_on_Food_Outside_Home

How much does (you/your household) spend in a typical month on food and beverages outside the home? I mean expenses at restaurants, lunches, canteens, coffee shops and the like. Please, include only the amounts (you/your household) pay out i.e. net of any employer subsidy/discount/promotion etc.

AMount_Spent_on_Utilities

How much does your household spend on utilities (electricity, water, gas, telephone, internet and television) in a typical month?

Amount_Spent_on_Consumer_Goods_Services

How much does a household spend in a typical month on all consumer goods and services? Includes all household expenses including food, utilities, etc. but excluding consumer durables (e.g. cars, household appliances, etc.), rent, loan repayments, insurance policies, renovation, etc

Total_Gross_Income

Total gross annual household income aggregate.

Data Cleaning and Transformation

Variable Recoding

Filtering the required columns from the dataset files provided us a dataset with necessary columns that define the income and expenditure of the households. Since the data was raw, recoding and renaming was performed for most of the columns. Each row in the dataset represents a household and as there were no rows with many missing values data cleaning did not take much effort apart from skipping columns that do not have much relevant information to be considered.

# Identify numeric columns
numeric_cols <- sapply(hcfs, is.numeric)

# Replace NAs with 0 in numeric columns
hcfs[numeric_cols][is.na(hcfs[numeric_cols])] <- 0

hcfs <- hcfs %>% 
  rename(Gender = DHGENDERH1, Age = DHAGEH1, Education_Level = DHEDUH1) %>%
  filter(Gender %in% c(1,2)) %>% 
  mutate(Gender = recode(Gender, `1` = "Male", `2` ="Female")) %>% 
  mutate(Age = age_groups(Age, split_at = c(35, 45, 55, 65, 75), na.rm = FALSE)) %>% 
  mutate(Education_Level = recode(Education_Level, 
                                  `0` = "No formal education",
                                  `1` = "Primary education",
                                  `2` = "Lower secondary",
                                  `3` = "Upper secondary",
                                  `4` = "Post-secondary",
                                  `5` = "First stage tertiary",
                                  `6` = "Second stage tertiary"))

hcfs <- hcfs %>% 
  rename(Employment_status = DHEMPH1) %>% 
  mutate(Employment_status = recode(Employment_status, 
                                  `1` = "Employee",
                                  `2` = "Self-employed",
                                  `3` = "Unemployed",
                                  `4` = "Retired",
                                  `5` = "Other"))

hcfs <- hcfs %>% 
  rename(
      Number_of_Household_Members = DH0001, 
      Number_of_Household_Members_in_Employment = DH0004,   
      Household_Type =  DHHTYPE,        
      Value_of_Household_Vehicles = DA1130,       
      Valuables = DA1131,  
      Deposits = DA2101,
      Mutual_Funds = DA2102,     
      Bonds = DA2103,           
      Employee_Income = DI1100,     
      Self_Employment_income = DI1200,      
      Rental_Income = DI1300,
      Has_Rental_Income = DI1300i,
      Financial_assets_Income = DI1400,    
      Pension_Income =  DI1500,
      Total_Real_Assets =  DA1000,   
      Total_Financial_Assets = DA2100, 
      Has_Real_Assets = DA1000i,    
      Has_Financial_Assets = DA2100i,    
      Has_Vehicles = DA1130i,  
      Has_Valuables = DA1131i,   
      Value_of_Self_employment_Businesses = DA1140i,      
      Has_Real_Estate_Wealth = DA1400i,     
      Has_Deposits = DA2101i,   
      Has_Mutual_Funds = DA2102i, 
      Has_Bonds = DA2103i,    
      Has_Shares = DA2105i,     
      Has_Debt = DL1000i,
      Housing_Status = DHHST,       
      Has_Employee_Income = DI1100i,       
      Has_Self_Employee_Income = DI1200i,   
      Has_Financial_assets_Income =   DI1400i,      
      Has_Income_From_Pensions = DI1500i,   
      Income_From_Other_Sources =   DI1800,     
      Has_Income_From_Other_Sources =  DI1800i,   
      Credit_Card_Debt = DL1220,  
      Has_Credit_Card_Debt =  DL1220i,       
      Way_Of_Acquring_Property = HB0600,         
      Monthly_Amount_Paid_As_Rent = HB2300,     
      Ownership_of_Cars = HB4300,     
      Total_Value_of_Cars = HB4400,   
      Has_Other_Vehicles = HB4500, 
      Value_Of_Other_Vehicles =   HB4600,    
      Ownership_Of_Other_Valuables = HB4700,     
      Value_Of_Other_Valuables = HB4710,                 
      Household_Has_a_Credit_Card = HC0300,    
      Has_Private_Loans = HC0330,          
      No_of_PrivateLoans = HC0340,   
      Has_Applied_for_Loan_Credit = HC1300,      
      Household_Owns_Saving_accounts = HD1200,  
      Value_of_Saving_Accounts = HD1210,      
      Investment_Attitudes = HD1800,            
      Amount_spent_on_Food_at_Home = HI0100,       
      Amount_Spent_on_Food_Outside_Home = HI0200,      
      AMount_Spent_on_Utilities = HI0210,       
      Amount_Spent_on_Consumer_Goods_Services = HI0220,
      Total_Gross_Income=DI2000
  )

hcfs <- hcfs %>%
    mutate(Household_Owns_Saving_accounts = recode(Household_Owns_Saving_accounts, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Ownership_of_Cars = recode(Ownership_of_Cars, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Ownership_Of_Other_Valuables = recode(Ownership_Of_Other_Valuables, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Real_Assets = recode(Has_Real_Assets, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Financial_Assets = recode(Has_Financial_Assets, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Vehicles = recode(Has_Vehicles, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Valuables = recode(Has_Valuables, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Real_Estate_Wealth = recode(Has_Real_Estate_Wealth, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Deposits = recode(Has_Deposits, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Mutual_Funds = recode(Has_Mutual_Funds, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Bonds = recode(Has_Bonds, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Shares = recode(Has_Shares, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Debt = recode(Has_Debt, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Employee_Income = recode(Has_Employee_Income, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Self_Employee_Income = recode(Has_Self_Employee_Income, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Financial_assets_Income = recode(Has_Financial_assets_Income, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Income_From_Pensions = recode(Has_Income_From_Pensions, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Income_From_Other_Sources = recode(Has_Income_From_Other_Sources, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Credit_Card_Debt = recode(Has_Credit_Card_Debt, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Other_Vehicles = recode(Has_Other_Vehicles, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Household_Has_a_Credit_Card = recode(Household_Has_a_Credit_Card, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Private_Loans = recode(Has_Private_Loans, `1` = "Yes", `2` ="No", .default = "No")) %>%
    mutate(Has_Applied_for_Loan_Credit = recode(Has_Applied_for_Loan_Credit, `1` = "Yes", `2` ="No", .default = "No")) %>% 
    mutate(Has_Rental_Income = recode(Has_Rental_Income, `1` = "Yes", `2` ="No", .default = "No"))

hcfs <- hcfs %>%
  mutate(Way_Of_Acquring_Property = recode(Way_Of_Acquring_Property, 
                                  `1` = "Purchased",
                                  `2` = "Own construction",
                                  `3` = "Inherited",
                                  `4` = "Gift",
                                  .default = "Inherited"))


hcfs <- hcfs %>%
  mutate(Housing_Status = recode(Housing_Status, 
                                  `1` = "Owner",
                                  `2` = "Owner with mortgage",
                                  `3` = "Renter"))


hcfs <- hcfs %>%
  mutate(Investment_Attitudes = recode(Investment_Attitudes, 
                                  `1` = "Take substantial financial risks",
                                  `2` = "Take above average financial risks",
                                  `3` = "Take average financial risks",
                                  `4` = "Not willing to take any financial risk"))


hcfs <- hcfs %>%
  mutate(Household_Type = recode(Household_Type, 
                                  `51` = "One adult, younger than 65 years",
                                  `52` = "One adult, 65 years and over",
                                  `6` = "Two adults younger than 65 years",
                                  `7` = "Two adults, at least one aged 65 years and over",
                                  `8` = "Three or more adults",
                                  `9` = "Single parent with dependent children",
                                  `10` = "Two adults with one dependent child",
                                  `11` = "Two adults with two dependent children",
                                  `12` = "Two adults with three or more dependent children",
                                  `13` = "Three or more adults with dependent children"))

Outlier detection

To detect outliers in the dataset the boxplot of the major metric variables were considered and the outlier values were replaced by mean values.

Structure and Summaries

The dataset has 8156 rows with 62 columns after cleaning and recoding. The structure and the distribution of values from the cleaned data is as follows,

ID

Number_of_Household_Members

Number_of_Household_Members_in_Employment

Household_Type

Value_of_Household_Vehicles

Valuables

Deposits

Mutual_Funds

Bonds

Employee_Income

Self_Employment_income

Rental_Income

Financial_assets_Income

Pension_Income

Total_Real_Assets

Total_Financial_Assets

Total_Gross_Income

Has_Real_Assets

Has_Financial_Assets

Has_Vehicles

Has_Valuables

Value_of_Self_employment_Businesses

Has_Real_Estate_Wealth

Has_Deposits

Has_Mutual_Funds

Has_Bonds

Has_Shares

Has_Debt

Has_Rental_Income

Housing_Status

Has_Employee_Income

Has_Self_Employee_Income

Has_Financial_assets_Income

Has_Income_From_Pensions

Income_From_Other_Sources

Has_Income_From_Other_Sources

Credit_Card_Debt

Has_Credit_Card_Debt

Age

Education_Level

Employment_status

Gender

Way_Of_Acquring_Property

Monthly_Amount_Paid_As_Rent

Ownership_of_Cars

Total_Value_of_Cars

Has_Other_Vehicles

Value_Of_Other_Vehicles

Ownership_Of_Other_Valuables

Value_Of_Other_Valuables

Household_Has_a_Credit_Card

Has_Private_Loans

No_of_PrivateLoans

Has_Applied_for_Loan_Credit

Household_Owns_Saving_accounts

Value_of_Saving_Accounts

Investment_Attitudes

Amount_spent_on_Food_at_Home

Amount_Spent_on_Food_Outside_Home

AMount_Spent_on_Utilities

Amount_Spent_on_Consumer_Goods_Services

IT100000173001

1

0

One adult, 65 years and over

10,000

1,000

20,000.000

0

0

0

0

0

172.053703

31,931.947

211,000

20,000.000

32,104.001

Yes

Yes

Yes

Yes

0

Yes

Yes

No

No

No

No

No

Owner

No

No

Yes

Yes

0

No

0

No

65-74

Upper secondary

Retired

Male

Purchased

0

Yes

10,000

No

0

Yes

1,000

No

No

0

No

No

0.000

Take above average financial risks

300

200

250.00000

1,000

IT100000375001

1

0

One adult, 65 years and over

0

500

0.000

0

0

0

0

0

0.000000

9,344.991

150,500

0.000

9,344.991

Yes

Yes

No

Yes

0

Yes

Yes

No

No

No

No

No

Owner

No

No

No

Yes

0

No

0

No

75+

Primary education

Other

Female

Purchased

0

No

0

No

0

Yes

500

No

No

0

No

No

0.000

Not willing to take any financial risk

300

0

16.66667

790

IT100000633001

1

0

One adult, 65 years and over

1,000

2,000

600.000

0

0

0

0

0

5.161611

15,652.588

133,000

600.000

15,657.750

Yes

Yes

Yes

Yes

0

Yes

Yes

No

No

No

No

No

Owner

No

No

Yes

Yes

0

No

0

No

75+

Primary education

Retired

Female

Inherited

0

Yes

1,000

No

0

Yes

2,000

No

No

0

No

Yes

600.000

Take average financial risks

400

30

150.00000

1,000

IT100000923001

1

0

One adult, 65 years and over

0

200

500.000

0

0

0

0

0

4.301343

7,150.000

870,200

500.000

10,154.301

Yes

Yes

No

Yes

0

Yes

Yes

No

No

No

No

No

Owner

No

No

Yes

Yes

0

No

0

No

75+

Primary education

Other

Female

Own construction

0

No

0

No

0

Yes

200

No

No

0

No

No

0.000

Not willing to take any financial risk

350

0

91.66667

800

IT100001367001

2

0

Two adults, at least one aged 65 years and over

0

1,000

13,000.000

0

0

0

0

0

111.834907

12,053.160

61,000

13,000.000

12,164.995

Yes

Yes

No

Yes

0

Yes

Yes

No

No

No

No

No

Owner

No

No

Yes

Yes

0

No

0

No

75+

Primary education

Retired

Male

Purchased

0

No

0

No

0

Yes

1,000

No

No

0

No

Yes

4,213.578

Take above average financial risks

250

50

83.33333

750

IT100001763001

2

0

Two adults, at least one aged 65 years and over

1,000

0

8,623.911

0

0

0

0

0

74.188787

8,060.000

26,000

8,623.911

8,134.189

Yes

Yes

Yes

No

0

Yes

Yes

No

No

No

No

No

Owner

No

No

Yes

Yes

0

No

0

No

65-74

Primary education

Retired

Male

Inherited

0

Yes

1,000

No

0

No

0

No

No

0

No

Yes

8,623.911

Not willing to take any financial risk

300

0

29.16667

800

## 'data.frame':    8156 obs. of  61 variables:
##  $ ID                                       : chr  "IT100000173001" "IT100000375001" "IT100000633001" "IT100000923001" ...
##  $ Number_of_Household_Members              : int  1 1 1 1 2 2 1 1 3 1 ...
##  $ Number_of_Household_Members_in_Employment: int  0 0 0 0 0 0 0 0 1 1 ...
##  $ Household_Type                           : chr  "One adult, 65 years and over" "One adult, 65 years and over" "One adult, 65 years and over" "One adult, 65 years and over" ...
##  $ Value_of_Household_Vehicles              : num  10000 0 1000 0 0 1000 0 200 8500 0 ...
##  $ Valuables                                : num  1000 500 2000 200 1000 0 3000 5000 500 200 ...
##  $ Deposits                                 : num  20000 0 600 500 13000 ...
##  $ Mutual_Funds                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bonds                                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Employee_Income                          : num  0 0 0 0 0 ...
##  $ Self_Employment_income                   : num  0 0 0 0 0 ...
##  $ Rental_Income                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Financial_assets_Income                  : num  172.05 0 5.16 4.3 111.83 ...
##  $ Pension_Income                           : num  31932 9345 15653 7150 12053 ...
##  $ Total_Real_Assets                        : num  211000 150500 133000 870200 61000 ...
##  $ Total_Financial_Assets                   : num  20000 0 600 500 13000 ...
##  $ Total_Gross_Income                       : num  32104 9345 15658 10154 12165 ...
##  $ Has_Real_Assets                          : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Has_Financial_Assets                     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Has_Vehicles                             : chr  "Yes" "No" "Yes" "No" ...
##  $ Has_Valuables                            : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Value_of_Self_employment_Businesses      : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Has_Real_Estate_Wealth                   : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Has_Deposits                             : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Has_Mutual_Funds                         : chr  "No" "No" "No" "No" ...
##  $ Has_Bonds                                : chr  "No" "No" "No" "No" ...
##  $ Has_Shares                               : chr  "No" "No" "No" "No" ...
##  $ Has_Debt                                 : chr  "No" "No" "No" "No" ...
##  $ Has_Rental_Income                        : chr  "No" "No" "No" "No" ...
##  $ Housing_Status                           : chr  "Owner" "Owner" "Owner" "Owner" ...
##  $ Has_Employee_Income                      : chr  "No" "No" "No" "No" ...
##  $ Has_Self_Employee_Income                 : chr  "No" "No" "No" "No" ...
##  $ Has_Financial_assets_Income              : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Has_Income_From_Pensions                 : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Income_From_Other_Sources                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Has_Income_From_Other_Sources            : chr  "No" "No" "No" "No" ...
##  $ Credit_Card_Debt                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Has_Credit_Card_Debt                     : chr  "No" "No" "No" "No" ...
##  $ Age                                      : Ord.factor w/ 6 levels "0-34"<"35-44"<..: 5 6 6 6 6 5 6 4 4 4 ...
##  $ Education_Level                          : chr  "Upper secondary" "Primary education" "Primary education" "Primary education" ...
##  $ Employment_status                        : chr  "Retired" "Other" "Retired" "Other" ...
##  $ Gender                                   : chr  "Male" "Female" "Female" "Female" ...
##  $ Way_Of_Acquring_Property                 : chr  "Purchased" "Purchased" "Inherited" "Own construction" ...
##  $ Monthly_Amount_Paid_As_Rent              : num  0 0 0 0 0 0 318 118 0 400 ...
##  $ Ownership_of_Cars                        : chr  "Yes" "No" "Yes" "No" ...
##  $ Total_Value_of_Cars                      : num  10000 0 1000 0 0 1000 0 0 8500 0 ...
##  $ Has_Other_Vehicles                       : chr  "No" "No" "No" "No" ...
##  $ Value_Of_Other_Vehicles                  : num  0 0 0 0 0 0 0 200 0 0 ...
##  $ Ownership_Of_Other_Valuables             : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Value_Of_Other_Valuables                 : num  1000 500 2000 200 1000 0 3000 5000 500 200 ...
##  $ Household_Has_a_Credit_Card              : chr  "No" "No" "No" "No" ...
##  $ Has_Private_Loans                        : chr  "No" "No" "No" "No" ...
##  $ No_of_PrivateLoans                       : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Has_Applied_for_Loan_Credit              : chr  "No" "No" "No" "No" ...
##  $ Household_Owns_Saving_accounts           : chr  "No" "No" "Yes" "No" ...
##  $ Value_of_Saving_Accounts                 : num  0 0 600 0 4214 ...
##  $ Investment_Attitudes                     : chr  "Take above average financial risks" "Not willing to take any financial risk" "Take average financial risks" "Not willing to take any financial risk" ...
##  $ Amount_spent_on_Food_at_Home             : int  300 300 400 350 250 300 200 250 600 250 ...
##  $ Amount_Spent_on_Food_Outside_Home        : int  200 0 30 0 50 0 0 50 100 0 ...
##  $ AMount_Spent_on_Utilities                : num  250 16.7 150 91.7 83.3 ...
##  $ Amount_Spent_on_Consumer_Goods_Services  : int  1000 790 1000 800 750 800 400 600 2500 400 ...
Data summary
Name hcfs
Number of rows 8156
Number of columns 61
_______________________
Column type frequency:
character 32
factor 1
numeric 28
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 14 14 0 8156 0
Household_Type 0 1 20 48 0 10 0
Has_Real_Assets 0 1 2 3 0 2 0
Has_Financial_Assets 0 1 2 3 0 2 0
Has_Vehicles 0 1 2 3 0 2 0
Has_Valuables 0 1 2 3 0 2 0
Has_Real_Estate_Wealth 0 1 2 3 0 2 0
Has_Deposits 0 1 2 3 0 2 0
Has_Mutual_Funds 0 1 2 3 0 2 0
Has_Bonds 0 1 2 3 0 2 0
Has_Shares 0 1 2 3 0 2 0
Has_Debt 0 1 2 3 0 2 0
Has_Rental_Income 0 1 2 3 0 2 0
Housing_Status 0 1 5 19 0 3 0
Has_Employee_Income 0 1 2 3 0 2 0
Has_Self_Employee_Income 0 1 2 3 0 2 0
Has_Financial_assets_Income 0 1 2 3 0 2 0
Has_Income_From_Pensions 0 1 2 3 0 2 0
Has_Income_From_Other_Sources 0 1 2 3 0 2 0
Has_Credit_Card_Debt 0 1 2 3 0 2 0
Education_Level 0 1 15 20 0 4 0
Employment_status 0 1 5 13 0 5 0
Gender 0 1 4 6 0 2 0
Way_Of_Acquring_Property 0 1 4 16 0 4 0
Ownership_of_Cars 0 1 2 3 0 2 0
Has_Other_Vehicles 0 1 2 3 0 2 0
Ownership_Of_Other_Valuables 0 1 2 3 0 2 0
Household_Has_a_Credit_Card 0 1 2 3 0 2 0
Has_Private_Loans 0 1 2 3 0 2 0
Has_Applied_for_Loan_Credit 0 1 2 3 0 2 0
Household_Owns_Saving_accounts 0 1 2 3 0 2 0
Investment_Attitudes 0 1 28 38 0 4 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Age 0 1 TRUE 6 75+: 1917, 65-: 1687, 55-: 1679, 45-: 1602

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Number_of_Household_Members 0 1 2.37 1.25 1 1.00 2.00 3.00 10.00 ▇▅▁▁▁
Number_of_Household_Members_in_Employment 0 1 0.77 0.86 0 0.00 1.00 1.00 5.00 ▇▂▁▁▁
Value_of_Household_Vehicles 0 1 6361.55 9529.77 0 500.00 4000.00 9000.00 275000.00 ▇▁▁▁▁
Valuables 0 1 3741.27 16739.98 0 500.00 1500.00 3500.00 1000000.00 ▇▁▁▁▁
Deposits 0 1 14581.40 48904.24 0 1000.00 5000.00 13500.00 2000000.00 ▇▁▁▁▁
Mutual_Funds 0 1 3472.32 27768.04 0 0.00 0.00 0.00 1000001.00 ▇▁▁▁▁
Bonds 0 1 6896.40 41288.50 0 0.00 0.00 0.00 2145209.43 ▇▁▁▁▁
Employee_Income 0 1 14168.89 22396.91 0 0.00 0.00 22949.66 379084.70 ▇▁▁▁▁
Self_Employment_income 0 1 5445.80 22072.92 0 0.00 0.00 0.00 761181.41 ▇▁▁▁▁
Rental_Income 0 1 451.93 3214.03 0 0.00 0.00 0.00 120000.00 ▇▁▁▁▁
Financial_assets_Income 0 1 399.98 1775.43 0 9.10 53.32 251.05 79133.90 ▇▁▁▁▁
Pension_Income 0 1 12935.45 16120.91 0 0.00 8923.77 20374.16 128532.26 ▇▂▁▁▁
Total_Real_Assets 0 1 221416.25 345197.20 0 29225.00 155500.00 272212.50 13600000.00 ▇▁▁▁▁
Total_Financial_Assets 0 1 30715.41 113895.15 0 1352.10 6819.18 26590.27 5143000.00 ▇▁▁▁▁
Total_Gross_Income 0 1 24905.01 11758.10 0 15379.55 25396.86 33734.56 49987.81 ▃▆▅▇▂
Value_of_Self_employment_Businesses 0 1 0.15 0.35 0 0.00 0.00 0.00 1.00 ▇▁▁▁▂
Income_From_Other_Sources 0 1 127.14 1349.77 0 0.00 0.00 0.00 53295.00 ▇▁▁▁▁
Credit_Card_Debt 0 1 12.35 200.93 0 0.00 0.00 0.00 7000.00 ▇▁▁▁▁
Monthly_Amount_Paid_As_Rent 0 1 61.35 152.31 0 0.00 0.00 0.00 1300.00 ▇▁▁▁▁
Total_Value_of_Cars 0 1 5868.75 7749.18 0 100.00 4000.00 8000.00 150000.00 ▇▁▁▁▁
Value_Of_Other_Vehicles 0 1 492.80 4478.09 0 0.00 0.00 0.00 250000.00 ▇▁▁▁▁
Value_Of_Other_Valuables 0 1 3741.27 16739.98 0 500.00 1500.00 3500.00 1000000.00 ▇▁▁▁▁
No_of_PrivateLoans 0 1 0.03 0.17 0 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Value_of_Saving_Accounts 0 1 1058.32 2178.47 0 0.00 0.00 0.00 10000.00 ▇▁▂▁▁
Amount_spent_on_Food_at_Home 0 1 444.45 222.34 50 300.00 400.00 600.00 2000.00 ▇▆▁▁▁
Amount_Spent_on_Food_Outside_Home 0 1 82.59 123.93 0 0.00 50.00 100.00 3000.00 ▇▁▁▁▁
AMount_Spent_on_Utilities 0 1 171.88 134.30 0 83.33 166.67 250.00 2083.33 ▇▁▁▁▁
Amount_Spent_on_Consumer_Goods_Services 0 1 1232.54 714.39 100 750.00 1000.00 1500.00 10000.00 ▇▁▁▁▁

Cross tabulations

To get a quick overview of the distribution of the data and to identify any patterns or relationships that may exist, the following cross tabulations were drawn and the results have been discussed accordingly.

Age Category/Gender

Female

Male

Total

0-34

129 (4.4%)

206 (3.9%)

335 (4.1%)

35-44

299 (10.3%)

637 (12.1%)

936 (11.5%)

45-54

521 (17.9%)

1,081 (20.6%)

1,602 (19.6%)

55-64

521 (17.9%)

1,158 (22.0%)

1,679 (20.6%)

65-74

531 (18.3%)

1,156 (22.0%)

1,687 (20.7%)

75+

902 (31.1%)

1,015 (19.3%)

1,917 (23.5%)

Table 1 : Age vs Gender

The table displays the number and percentage of individuals by age and gender category ranging from 0-34 to 75+ years. It can be seen that the respondents are majorly Males and of age group 55-64.

Level of Education/Gender

Female

Male

Total

First stage tertiary

384 (13.2%)

620 (11.8%)

1,004 (12.3%)

Lower secondary

644 (22.2%)

1,685 (32.1%)

2,329 (28.6%)

Primary education

1,010 (34.8%)

1,178 (22.4%)

2,188 (26.8%)

Upper secondary

865 (29.8%)

1,770 (33.7%)

2,635 (32.3%)

Table 2 : Education level vs Gender

The table displays the distribution of education levels among females and males. We observe that the majority of females have completed primary education (34.8%), followed by lower secondary (22.2%), while the majority of males have completed upper secondary (33.7%), followed by lower secondary (32.1%).

Employment_status/Gender

Female

Male

Total

Employee

968 (33.3%)

2,007 (38.2%)

2,975 (36.5%)

Other

798 (27.5%)

127 (2.4%)

925 (11.3%)

Retired

898 (30.9%)

2,260 (43.0%)

3,158 (38.7%)

Self-employed

163 (5.6%)

703 (13.4%)

866 (10.6%)

Unemployed

76 (2.6%)

156 (3.0%)

232 (2.8%)

Table 3 : Employment status vs Gender

The above table presents the distribution of employment status among females and males in the population being studied. The results indicate that a higher proportion of males are employed compared to females (38.2% vs. 33.3%). On the other hand, a higher proportion of females are retired compared to males (30.9% vs. 43.0%). Additionally, a small proportion of both males and females are self-employed or unemployed.

Education_Level/Employment_status

Employee

Other

Retired

Self-employed

Unemployed

Total

First stage tertiary

546

12

252

181

13

1,004

Lower secondary

939

198

792

283

117

2,329

Primary education

156

614

1,319

58

41

2,188

Upper secondary

1,334

101

795

344

61

2,635

Table 4 : Education Level vs Employment Status

The table presents the cross-tabulation between Education Level and Employment Status. The highest count of individuals falls in the Education Level category of Lower Secondary (2,329) and Employment Status category of Employee (2,635). The lowest count of individuals falls in the Education Level category of First Stage Tertiary and Employment Status category of Unemployed (13). The highest count of individuals in Education Level category of First Stage Tertiary is employed in the Self-Employed category (181), while the highest count of individuals in Education Level category of Lower Secondary and Primary Education are employed in the Employee category (939 and 1,319, respectively).

Investment Attitude/Gender

Female

Male

Total

Not willing to take any financial risk

1,942 (66.9%)

2,881 (54.8%)

4,823 (59.1%)

Take above average financial risks

247 (8.5%)

644 (12.3%)

891 (10.9%)

Take average financial risks

708 (24.4%)

1,683 (32.0%)

2,391 (29.3%)

Take substantial financial risks

6 (0.2%)

45 (0.9%)

51 (0.6%)

Table 5 : Investment Attitude by Gender

The Cross tabulation presents the distribution of Investment Attitude by Gender. The majority of the respondents, both female (66.9%) and male (54.8%), were not willing to take any financial risk. A small percentage of respondents, both female (8.5%) and male (12.3%), were willing to take above-average financial risks. The percentage of females who were willing to take average financial risks (24.4%) was slightly higher than males (32.0%). Finally, a negligible percentage of respondents, both female (0.2%) and male (0.9%), were willing to take substantial financial risks. Overall, the results suggest that both female and male respondents were generally risk-averse regarding their investment attitude.

Number of Household Members/Number of Household Members in Employment

0

1

2

3

4

5

1

1,732

662

0

0

0

0

2

1,637

667

284

0

0

0

3

296

646

505

53

0

0

4

124

450

537

94

17

0

5

43

141

111

34

9

2

6

10

39

20

10

5

1

7

6

10

2

3

0

0

8

2

1

1

0

0

0

9

0

1

0

0

0

0

10

0

0

1

0

0

0

Table 6 : Frequency of households by the number of household members vs number of household members in employment

The table represents the frequency of households by the number of household members and the number of household members in employment. The majority of households have no members in employment, and this is more common among females. As the number of household members in employment increases, the frequency of households decreases. The highest frequency of households is observed in the category of one household member with one household member in employment (662), followed by two household members with two household members in employment (537).

Gender

N

Employement

Self

Rental

Financial

Pension

Total_Gross_Income

Female

2,903

10,287.83

3,407.447

334.9150

283.8476

11,585.23

20,823.63

Male

5,253

16,313.70

6,572.261

516.6009

464.1626

13,681.64

27,160.53

Table 7 : Gender and their Mean Income

The table represents the income from different sources of the households based on gender. There are 2,903 households with a female head and 5,253 households with a male head. On average, male-headed households have higher total gross income than female-headed households. Male-headed households also have higher average income from self-employment, rental, financial, and pension sources. The difference in total gross income between male and female-headed households may be due to various factors such as differences in education, work experience, and job opportunities. However, without further analysis, it is difficult to draw any definite conclusions.

Gender

N

Food

Consumer_Goods

Utilities

Female

2,903

374.9983

1,035.087

151.5144

Male

5,253

482.8249

1,341.658

183.1321

Table 8 : Gender and their expenditure

The table presents the average expenditures on food, consumer goods, and utilities for females and males. From the table, we can see that males have higher average expenditures in all categories than females. Specifically, males spend on average 482.8249 more on food, 1,341.658 more on consumer goods, and 183.1321 more on utilities than females.

Household_Type

N

Food

Consumer_Goods

Utilities

One adult, 65 years and over

1,494

294.4311

820.3313

138.4351

One adult, younger than 65 years

900

264.2100

833.2056

123.4713

Single parent with dependent children

222

359.8649

971.9144

149.6734

Three or more adults

1,052

582.5095

1,572.0608

204.5954

Three or more adults with dependent children

452

597.4115

1,571.5044

192.7240

Two adults with one dependent child

653

492.7259

1,431.0628

192.0776

Two adults with three or more dependent children

198

604.0404

1,475.7576

181.7508

Two adults with two dependent children

727

549.4498

1,551.8982

202.6058

Two adults younger than 65 years

714

437.1429

1,273.7759

175.5048

Two adults, at least one aged 65 years and over

1,744

476.8291

1,280.2993

180.2190

Table 9 : Household Type and their expenditure

The cross-tabulation result presented in the table shows the distribution of Household_Type in the hcfs dataset with respect to three expenditure categories: Food, Consumer_Goods, and Utilities. The table also provides the number of observations (N) in each category. From the results, we can observe that the category with the largest number of observations is “Two adults, at least one aged 65 years and over” with 1,744 observations, while the category with the smallest number of observations is “Two adults with three or more dependent children” with only 198 observations.

## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

Gender

Investment_Attitudes

N

Employement

Self

Rental

Financial

Pension

Total_Gross_Income

Female

Not willing to take any financial risk

1,942

8,501.438

1,875.688

146.1869

185.6105

11,414.420

19,047.92

Female

Take above average financial risks

247

13,064.831

6,729.370

559.6761

415.9566

9,573.682

21,380.09

Female

Take average financial risks

708

14,084.207

6,416.304

774.4678

500.9052

12,774.641

25,397.86

Female

Take substantial financial risks

6

26,189.951

7,389.354

300.0000

1,028.6376

9,329.190

32,893.97

Male

Not willing to take any financial risk

2,881

13,668.282

4,721.677

284.1855

239.1875

13,319.018

25,885.58

Male

Take above average financial risks

644

18,437.799

9,590.465

756.4887

783.5533

13,569.556

27,321.31

Male

Take average financial risks

1,683

19,980.768

8,394.089

787.3876

697.6551

14,440.865

29,313.07

Male

Take substantial financial risks

45

18,132.896

13,720.563

1,835.8667

1,564.1136

10,106.111

25,978.98

Table 10 : Investment Attitude with Gender and their mean Income

The table presents investment attitudes and total gross income of males and females in four categories of investment attitudes (Not willing to take any financial risk, Take above average financial risks, Take average financial risks, and Take substantial financial risks). From the table, it can be seen that:

Males tend to have a higher total gross income than females across all categories of investment attitudes. Both males and females who are willing to take above average or substantial financial risks tend to have higher total gross income than those who are not willing to take any financial risks. Males tend to have higher income in the categories of Take above average financial risks and Take substantial financial risks, while females tend to have higher income in the category of Take average financial risks.

Housing_Status/Gender

Female

Male

Total

Owner

1,795 (61.8%)

3,436 (65.4%)

5,231 (64.1%)

Owner with mortgage

166 (5.7%)

478 (9.1%)

644 (7.9%)

Renter

942 (32.4%)

1,339 (25.5%)

2,281 (28.0%)

Table 11 : Gender and Housing status

From the cross tabulation of gender and housing status, it can be inferred that the majority of respondents were owners (64.1%) followed by renters (28%) and those with a mortgage (7.9%). Females were more likely to be renters (32.4%) compared to males (25.5%) and males were more likely to be owners (65.4%) compared to females (61.8%). When looking at investment attitudes, a larger percentage of females were not willing to take any financial risks (56.9%) compared to males (52.1%). However, a larger percentage of males were willing to take above-average financial risks (21.9%) compared to females (10.8%).

Has_Private_Loans/Education_Level

First stage tertiary

Lower secondary

Primary education

Upper secondary

Total

No

983 (97.9%)

2,228 (95.7%)

2,136 (97.6%)

2,572 (97.6%)

7,919 (97.1%)

Yes

21 (2.1%)

101 (4.3%)

52 (2.4%)

63 (2.4%)

237 (2.9%)

Table 12 : Private loans by Education Level

The table represents the distribution of individuals based on their education level and whether they have private loans. It can be observed that a vast majority of individuals with different education levels do not have private loans. The highest percentage of individuals without private loans was observed among those with a first stage tertiary education (97.9%). On the other hand, the highest percentage of individuals with private loans was observed among those with lower secondary education (4.3%).

## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

Gender

Has_Credit_Card_Debt

N

Debt

Female

No

2,891

0.000

Female

Yes

12

1,155.833

Male

No

5,201

0.000

Male

Yes

52

1,669.542

Table 13 : Who has credit card debit?

This table shows the relationship between gender and having credit card debt, as well as the amount of debt for those who have it. Among females, 2,891 have no credit card debt, while 12 have an average debt of 1,155.833. Among males, 5,201 have no credit card debt, while 52 have an average debt of 1,669.542.

## `summarise()` has grouped output by 'Gender', 'Education_Level'. You can
## override using the `.groups` argument.

Gender

Education_Level

Age

N

Savings

Vehicles

Business

Female

First stage tertiary

0-34

38

1,504.18504

5,643.4211

0.15789474

Female

First stage tertiary

35-44

68

1,616.13056

9,181.4706

0.26470588

Female

First stage tertiary

45-54

104

1,014.89171

9,559.3846

0.29807692

Female

First stage tertiary

55-64

101

1,356.14387

10,817.3960

0.17821782

Female

First stage tertiary

65-74

48

1,547.08783

5,378.1250

0.06250000

Female

First stage tertiary

75+

25

1,579.19583

4,164.0000

0.04000000

Female

Lower secondary

0-34

30

881.47529

2,153.3333

0.16666667

Female

Lower secondary

35-44

83

292.73803

6,420.4819

0.12048193

Female

Lower secondary

45-54

143

581.13098

5,044.3357

0.18881119

Female

Lower secondary

55-64

145

1,355.99080

5,226.6207

0.08965517

Female

Lower secondary

65-74

142

1,272.99020

2,409.8592

0.02816901

Female

Lower secondary

75+

101

1,283.24751

1,251.0891

0.02970297

Female

Primary education

0-34

2

0.00000

1,790.0000

0.00000000

Female

Primary education

35-44

9

473.13545

2,088.8889

0.00000000

Female

Primary education

45-54

30

225.63807

1,916.6667

0.10000000

Female

Primary education

55-64

80

850.58586

3,589.7500

0.03750000

Female

Primary education

65-74

210

892.45802

1,299.5333

0.02380952

Female

Primary education

75+

679

1,274.00131

428.7069

0.00736377

Female

Upper secondary

0-34

59

484.85266

3,692.5254

0.11864407

Female

Upper secondary

35-44

139

1,043.91575

6,704.9640

0.15107914

Female

Upper secondary

45-54

244

815.20489

6,512.8689

0.15983607

Female

Upper secondary

55-64

195

1,169.10265

7,463.3846

0.16410256

Female

Upper secondary

65-74

131

1,117.80542

3,968.8550

0.05343511

Female

Upper secondary

75+

97

1,241.76545

1,326.7010

0.02061856

Male

First stage tertiary

0-34

38

657.70545

8,897.3684

0.18421053

Male

First stage tertiary

35-44

106

1,190.48946

10,458.3019

0.36792453

Male

First stage tertiary

45-54

136

1,305.27686

13,926.1029

0.42647059

Male

First stage tertiary

55-64

148

1,040.88955

13,406.3514

0.34459459

Male

First stage tertiary

65-74

129

1,452.11997

11,596.1938

0.25581395

Male

First stage tertiary

75+

63

1,020.31785

7,983.3333

0.15873016

Male

Lower secondary

0-34

68

130.15139

4,734.1176

0.08823529

Male

Lower secondary

35-44

227

747.03433

6,462.4141

0.17621145

Male

Lower secondary

45-54

435

806.93091

7,964.8276

0.22988506

Male

Lower secondary

55-64

434

836.39428

7,612.5783

0.24193548

Male

Lower secondary

65-74

314

1,307.01788

5,931.9427

0.06687898

Male

Lower secondary

75+

207

1,332.23324

3,662.2754

0.03381643

Male

Primary education

0-34

6

0.00000

2,841.6667

0.16666667

Male

Primary education

35-44

17

65.19703

2,070.5882

0.11764706

Male

Primary education

45-54

49

136.57966

3,440.0000

0.12244898

Male

Primary education

55-64

158

1,064.51704

6,178.7975

0.17088608

Male

Primary education

65-74

378

1,156.10350

5,374.2857

0.10582011

Male

Primary education

75+

570

1,357.05538

2,944.0982

0.01929825

Male

Upper secondary

0-34

94

822.45969

7,127.9787

0.17021277

Male

Upper secondary

35-44

287

1,023.07805

8,768.8850

0.16376307

Male

Upper secondary

45-54

461

1,027.79348

12,224.7722

0.29501085

Male

Upper secondary

55-64

418

840.13717

9,359.5024

0.23923445

Male

Upper secondary

65-74

335

1,188.00964

9,153.1970

0.16119403

Male

Upper secondary

75+

175

1,000.65858

6,104.9143

0.08571429

Table 14 : Who has savings, vehicles, and business ownership?

The table displays the summary statistics for savings, vehicles, and business ownership for different demographic groups, including gender, education level, and age. In general, females tend to save less and own fewer vehicles and businesses compared to males. Additionally, individuals with higher education levels tend to have more savings and own more vehicles and businesses compared to those with lower education levels. Finally, older individuals tend to have more savings and own more vehicles and businesses compared to younger individuals.

Data Visualization

Graphs

The following graphs for the data were drawn to visualize the dataset and obtain insights.

Q-Q Plots

To obtain a series of QQ (Quantile-Quantile) plots for the data we considered different subsets of data. Subset 1 includes income-related variables, subset 2 includes variables related to household expenses, and subset 3 includes variables related to household assets. For each subset, a loop is used to create multiple QQ plots, one for each variable in the subset. The resulting QQ plots were used to check whether the data follow a normal distribution.

Tiles

The ggplot graphs, display the distribution of a different set of variables. The graphs display the relationship between two variables with a color-coded tile. The color of the tile represents the value of a third variable Total_Gross_Income, Amount_Spent_on_Consumer_Goods_Services, or Investment_Attitudes

Pie Charts

# Create a table of education levels
edu_table <- table(hcfs$Education_Level)

# Plot a pie chart with the frequency count and percentage labels
pie(edu_table, main = "Education Level", labels = paste(names(edu_table), " (", round(100*edu_table/sum(edu_table),1), "%)", sep = ""))

# Create a table of education levels
edu_table <- table(hcfs$Gender)

# Plot a pie chart with the frequency count and percentage labels
pie(edu_table, main = "Education Level", labels = paste(names(edu_table), " (", round(100*edu_table/sum(edu_table),1), "%)", sep = ""))

# Create a table of education levels
edu_table <- table(hcfs$Investment_Attitudes)

# Plot a pie chart with the frequency count and percentage labels
pie(edu_table, main = "Education Level", labels = paste(names(edu_table), " (", round(100*edu_table/sum(edu_table),1), "%)", sep = ""))

# Create a table of education levels
edu_table <- table(hcfs$Employment_status)

# Plot a pie chart with the frequency count and percentage labels
pie(edu_table, main = "Education Level", labels = paste(names(edu_table), " (", round(100*edu_table/sum(edu_table),1), "%)", sep = ""))

Skewness

To understand how the data is distributed, the shape and center we have computed the skew values of the major metric columns and plotted the following histograms to visualize them.

##                      Total_Gross_Income               AMount_Spent_on_Utilities 
##                             -0.08913195                              2.50346726 
## Amount_Spent_on_Consumer_Goods_Services                         Employee_Income 
##                              2.28148430                              2.99082831 
##                  Self_Employment_income                 Financial_assets_Income 
##                             10.72115257                             21.42052488 
##     Value_of_Self_employment_Businesses                          Pension_Income 
##                              1.99882600                              1.90081580 
##            Amount_spent_on_Food_at_Home                           Rental_Income 
##                              1.32405321                             17.32616287 
##                        Credit_Card_Debt                Value_of_Saving_Accounts 
##                             22.80546422                              2.12286317 
##               Income_From_Other_Sources 
##                             24.21245486
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Correlation

The following set of scatter plots show the pairwise relationships between the numeric variables in the dataset. Considering three subsets of the dataset that include different combinations of variables related to income, expenses, and assets the plots visually identify patterns and correlations.

To further visualize the relationship between the variables correlation matrix, and correlation plot using the corrplot package are drawn.

Total_Gross_Income AMount_Spent_on_Utilities Amount_Spent_on_Consumer_Goods_Services Employee_Income Self_Employment_income Financial_assets_Income Rental_Income Pension_Income Credit_Card_Debt Value_of_Saving_Accounts
Total_Gross_Income 1.0000000 0.3377350 0.5648751 0.4186945 0.1826003 0.1475132 0.1049226 0.2712045 0.0395201 0.0858689
AMount_Spent_on_Utilities 0.3377350 1.0000000 0.4557343 0.1940775 0.1776377 0.1753224 0.1274115 0.1862575 0.0007820 0.0373934
Amount_Spent_on_Consumer_Goods_Services 0.5648751 0.4557343 1.0000000 0.4438192 0.2858661 0.2763094 0.1417755 0.2389673 0.0147197 0.0794036
Employee_Income 0.4186945 0.1940775 0.4438192 1.0000000 -0.0374893 0.1185655 0.0219828 -0.3372886 0.0296488 0.0241524
Self_Employment_income 0.1826003 0.1776377 0.2858661 -0.0374893 1.0000000 0.1339790 0.0725096 -0.0818505 0.0115530 0.0046256
Financial_assets_Income 0.1475132 0.1753224 0.2763094 0.1185655 0.1339790 1.0000000 0.3209045 0.1862423 -0.0114550 0.0277844
Rental_Income 0.1049226 0.1274115 0.1417755 0.0219828 0.0725096 0.3209045 1.0000000 0.0765708 -0.0086400 0.0323215
Pension_Income 0.2712045 0.1862575 0.2389673 -0.3372886 -0.0818505 0.1862423 0.0765708 1.0000000 -0.0231590 0.0713592
Credit_Card_Debt 0.0395201 0.0007820 0.0147197 0.0296488 0.0115530 -0.0114550 -0.0086400 -0.0231590 1.0000000 -0.0112304
Value_of_Saving_Accounts 0.0858689 0.0373934 0.0794036 0.0241524 0.0046256 0.0277844 0.0323215 0.0713592 -0.0112304 1.0000000

The correlation matrix shows the pairwise correlations between the variables in the dataset. The correlation coefficient ranges from -1 to 1, where a value of 1 indicates a perfect positive correlation, a value of 0 indicates no correlation, and a value of -1 indicates a perfect negative correlation.

  • The variables Total Gross Income and Amount Spent on Consumer Goods and Services have a moderate positive correlation 0.56, while the variables Amount Spent on Utilities and Amount Spent on Consumer Goods and Services have a moderate positive correlation 0.46.

  • The variables Employee Income and Pension Income have a weak positive correlation with Total Gross Income 0.36 and 0.27, respectively, while Self Employment Income and Financial Assets Income have a weak positive correlation with Total Gross Income 0.23 and 0.15, respectively.

  • The variables Credit Card Debt and Value of Saving Accounts have weak positive correlations with Total Gross Income 0.04 and 0.09, respectively.

Value_of_Household_Vehicles Valuables Deposits Mutual_Funds Bonds Value_of_Self_employment_Businesses Total_Real_Assets Income_From_Other_Sources Monthly_Amount_Paid_As_Rent Total_Value_of_Cars Value_Of_Other_Vehicles Value_Of_Other_Valuables No_of_PrivateLoans Amount_spent_on_Food_at_Home Amount_Spent_on_Food_Outside_Home
Value_of_Household_Vehicles 1.0000000 0.1386356 0.1951227 0.0925590 0.1498322 0.2718373 0.3400578 0.0011845 -0.0923150 0.8856921 0.5954281 0.1386356 -0.0468559 0.3098402 0.3050785
Valuables 0.1386356 1.0000000 0.2320160 0.0725267 0.0754184 0.0895533 0.4907918 0.0065577 -0.0507739 0.1194925 0.0882509 1.0000000 -0.0256304 0.1098660 0.1048881
Deposits 0.1951227 0.2320160 1.0000000 0.2019328 0.2248474 0.0871600 0.3522789 -0.0035471 -0.0706288 0.1468712 0.1610826 0.2320160 -0.0430908 0.1272698 0.1355410
Mutual_Funds 0.0925590 0.0725267 0.2019328 1.0000000 0.1058151 0.0569125 0.1766471 0.0213245 -0.0374539 0.1021025 0.0202889 0.0725267 -0.0195802 0.0837952 0.0979383
Bonds 0.1498322 0.0754184 0.2248474 0.1058151 1.0000000 0.0459544 0.1910334 0.0168866 -0.0466830 0.1419004 0.0733025 0.0754184 -0.0260860 0.0916031 0.1339986
Value_of_Self_employment_Businesses 0.2718373 0.0895533 0.0871600 0.0569125 0.0459544 1.0000000 0.2747753 -0.0004089 -0.0128197 0.2831667 0.0884834 0.0895533 0.0005682 0.1714202 0.2137338
Total_Real_Assets 0.3400578 0.4907918 0.3522789 0.1766471 0.1910334 0.2747753 1.0000000 -0.0033781 -0.2230694 0.3240399 0.1629330 0.4907918 -0.0583236 0.2750643 0.2300403
Income_From_Other_Sources 0.0011845 0.0065577 -0.0035471 0.0213245 0.0168866 -0.0004089 -0.0033781 1.0000000 0.0070449 0.0019005 -0.0007681 0.0065577 0.0364342 0.0299453 0.0133272
Monthly_Amount_Paid_As_Rent -0.0923150 -0.0507739 -0.0706288 -0.0374539 -0.0466830 -0.0128197 -0.2230694 0.0070449 1.0000000 -0.0954481 -0.0312848 -0.0507739 0.0602591 -0.0814419 -0.0425896
Total_Value_of_Cars 0.8856921 0.1194925 0.1468712 0.1021025 0.1419004 0.2831667 0.3240399 0.0019005 -0.0954481 1.0000000 0.1543650 0.1194925 -0.0520367 0.3301675 0.3357399
Value_Of_Other_Vehicles 0.5954281 0.0882509 0.1610826 0.0202889 0.0733025 0.0884834 0.1629330 -0.0007681 -0.0312848 0.1543650 1.0000000 0.0882509 -0.0096657 0.0880236 0.0682474
Value_Of_Other_Valuables 0.1386356 1.0000000 0.2320160 0.0725267 0.0754184 0.0895533 0.4907918 0.0065577 -0.0507739 0.1194925 0.0882509 1.0000000 -0.0256304 0.1098660 0.1048881
No_of_PrivateLoans -0.0468559 -0.0256304 -0.0430908 -0.0195802 -0.0260860 0.0005682 -0.0583236 0.0364342 0.0602591 -0.0520367 -0.0096657 -0.0256304 1.0000000 -0.0601121 -0.0359474
Amount_spent_on_Food_at_Home 0.3098402 0.1098660 0.1272698 0.0837952 0.0916031 0.1714202 0.2750643 0.0299453 -0.0814419 0.3301675 0.0880236 0.1098660 -0.0601121 1.0000000 0.3265420
Amount_Spent_on_Food_Outside_Home 0.3050785 0.1048881 0.1355410 0.0979383 0.1339986 0.2137338 0.2300403 0.0133272 -0.0425896 0.3357399 0.0682474 0.1048881 -0.0359474 0.3265420 1.0000000

The correlation matrix suggests that the Total Real Assets have a strong positive correlation with the Value of Self-employment Businesses (0.27), Valuables (0.49), and Total Value of Cars (0.32). The Monthly Amount Paid as Rent has a negative correlation with Total Real Assets (-0.22) and Value of Household Vehicles (-0.09).

The other variables have low or moderate correlations with each other. For example, Income from Other Sources has a very low correlation with most other variables, while No of Private Loans has a moderate positive correlation with Valuables (0.23).

Hypothesis Testing

Hypothesis Statements

To perform Hypothesis testing on the dataset the following set of questions were considered to test whether there is a significant difference between two groups or whether there is a significant relationship between two variables. Based on previous computations of data distribution and skew values the parametric and non parametric tests were selected.

Hypothesis Statements of Independent Variable Gender

S.No

Dependent Variable

Statistical Question

Null Hypothesis

Alternative Hypothesis

Test

1

Total_Gross_Income

Is there a significant difference in Total Gross income between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Total Gross Income.

There is a difference between the Male and Female groups with respect to the dependent variable Total Gross Income.

T test

2

AMount_Spent_on_Utilities

Does the Amount Spent on Utilities significantly differ between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Amount spent on utilities.

There is a difference between the Male and Female groups with respect to the dependent variable Amount spent on utilities

Wilcoxon-Mann-Whitney test

3

Amount_Spent_on_Consumer_Goods_Services

Is there a significant difference in the amount spent on Consumer goods and services among males and females?

There is no difference between the male and female groups with respect to the dependent variable Amount spent on Consumer goods and services

There is difference between the male and female groups with respect to the dependent variable Amount spent on Consumer goods and services

Wilcoxon-Mann-Whitney test

4

Employee_Income

Does the dependent variable Employee income show a significant difference between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Employee_Income.

There is a difference between the Male and Female groups with respect to the dependent variable Employee_Income.

Wilcoxon-Mann-Whitney test

5

Self_Employment_income

Is there a significant difference in self-employee income between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Self_Employee_Income

There is a difference between the Male and Female groups with respect to the dependent variable Self_Employee_Income

Wilcoxon-Mann-Whitney test

6

Financial_assets_Income

Does the Financial asset income significantly differ between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Financial_assets_Income

There is a difference between the Male and Female groups with respect to the dependent variable Financial_assets_Income

Wilcoxon-Mann-Whitney test

7

Pension_Income

Is there a significant difference in the pension income earned between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Pension_Income.

There is no difference between the Male and Female groups with respect to the dependent variable Pension_Income.

Wilcoxon-Mann-Whitney test

8

Rental_Income

Does the Rental income earned significantly differ between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Rental_Income.

There is a difference between the Male and Female groups with respect to the dependent variable Rental_Income.

Wilcoxon-Mann-Whitney test

9

Credit_Card_Debt

Does the pattern of Credit Card debt significantly differ between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Credit_Card_Debt.

There is a difference between the Male and Female groups with respect to the dependent variable Credit_Card_Debt.

Wilcoxon-Mann-Whitney test

10

Value_of_Saving_Accounts

Is there a significant difference in the value of saving account held between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Value_of_Saving_Accounts.

There is a difference between the Male and Female groups with respect to the dependent variable Value_of_Saving_Accounts.

Wilcoxon-Mann-Whitney test

11

Value_of_Self_employment_Businesses

Is there a significant difference in the value of self-employment business held between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Value_of_Self_employment_Businesses.

There is a difference between the Male and Female groups with respect to the dependent variable Value_of_Self_employment_Businesses.

Wilcoxon-Mann-Whitney test

12

Amount_spent_on_Food_at_Home

Does the Amount Spent on Food at home significantly differ between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Amount_spent_on_Food_at_Home.

There is a difference between the Male and Female groups with respect to the dependent variable Amount_spent_on_Food_at_Home.

Wilcoxon-Mann-Whitney test

13

Income_From_Other_Sources

Is there a significant difference in the income earned from other sources between males and females?

There is no difference between the Male and Female groups with respect to the dependent variable Income_From_Other_Sources.

There is a difference between the Male and Female groups with respect to the dependent variable Income_From_Other_Sources.

Wilcoxon-Mann-Whitney test

Hypothesis Statements of Independent Variable Age

S.No

Dependent Variable

Statistical Question

Null Hypothesis

Alternative Hypothesis

Test

1

Total_Gross_Income

Is there a statistically significant difference in Total Gross Income across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Total_Gross_Income.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Total_Gross_Income.

ANOVA

2

AMount_Spent_on_Utilities

Is there a statistically significant change in the amount spent on utilities across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_Spent_on_Utilities.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Amount_Spent_on_Utilities.

Kruskal-Wallis rank sum test

3

Amount_Spent_on_Consumer_Goods_Services

Does the amount spent on consumer goods & services show statistically significant variation with age?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_Spent_on_Consumer_Goods_Services.

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_Spent_on_Consumer_Goods_Services.

Kruskal-Wallis rank sum test

4

Employee_Income

Is there a statistically significant difference in employee income across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Employee_Income.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Employee_Income.

Kruskal-Wallis rank sum test

5

Self_Employment_income

Is there a statistically significant difference in self-employment income between different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Self_Employment_income.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Self_Employment_income.

Kruskal-Wallis rank sum test

6

Financial_assets_Income

Does the Financial assets income show statistically significant difference across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Financial_assets_Income.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Financial_assets_Income.

Kruskal-Wallis rank sum test

7

Rental_Income

Is there a statistically significant change in rental income across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Rental_Income.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Rental_Income.

Kruskal-Wallis rank sum test

8

Credit_Card_Debt

Does Credit Card debt show statistically significant difference across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Credit_Card_Debt.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Credit_Card_Debt.

Kruskal-Wallis rank sum test

9

Value_of_Saving_Accounts

Is there a statistically significant difference in the value of savings account among different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Value_of_Saving_Accounts.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Value_of_Saving_Accounts.

Kruskal-Wallis rank sum test

10

Value_of_Self_employment_Businesses

Is there a statistically significant difference in the value of self emploment business among different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Value_of_Self_employment_Businesses.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Value_of_Self_employment_Businesses.

Kruskal-Wallis rank sum test

11

Amount_spent_on_Food_at_Home

Is there a statistically significant change in the amount spent on food across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_spent_on_Food_at_Home.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Amount_spent_on_Food_at_Home.

Kruskal-Wallis rank sum test

12

Income_From_Other_Sources

Is there a statistically significant difference in Income from other sources across different age groups?

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Income_From_Other_Sources.

There is a difference between the 6 categories of the independent variable Age with respect to the dependent variable Income_From_Other_Sources.

Kruskal-Wallis rank sum test

Hypothesis Statements of Independent Variable Education Level

S.No

Dependent Variable

Statistical Question

Null Hypothesis

Alternative Hypothesis

Test

1

Total_Gross_Income

Is there a significant relationship between education level and total gross income?

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Total_Gross_Income.

There is a difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Total_Gross_Income.

ANOVA

2

AMount_Spent_on_Utilities

Is there a significant difference in utility expenses among different education levels?

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Amount_Spent_on_Utilities.

There is a difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Amount_Spent_on_Utilities.

Kruskal-Wallis rank sum test

3

Amount_Spent_on_Consumer_Goods_Services

Is there a significant difference in consumer goods and services expenses among different education levels?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Amount_spent_on_consumer_goods.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Amount_spent_on_consumer_goods.

Kruskal-Wallis rank sum test

4

Employee_Income

Does education level have a significant impact on employee income?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Employee_Income.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Employee_Income.

Kruskal-Wallis rank sum test

5

Self_Employment_income

Is there a significant difference in self-employment income among different education levels?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Self_Employment_income.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Self_Employment_income.

Kruskal-Wallis rank sum test

6

Financial_assets_Income

Is there a significant relationship between education level and financial asset income?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Financial_assets_Income.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Financial_assets_Income.

Kruskal-Wallis rank sum test

7

Pension_Income

Is there a significant difference in pension income among different education levels?

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Pension_Income.

There is a difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Pension_Income.

Kruskal-Wallis rank sum test

8

Credit_Card_Debt

Is there a significant difference in Credit Card Debt among different education levels?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Credit_Card_Debt.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Credit_Card_Debt.

Kruskal-Wallis rank sum test

9

Value_of_Saving_Accounts

Is there a significant relationship between education level and value of savings account?

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Value_of_Saving_Accounts.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Value_of_Saving_Accounts.

Kruskal-Wallis rank sum test

10

Value_of_Self_employment_Businesses

Is there a significant relationship between education level and value of self employment business?

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Value_of_Self_employment_Businesses.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Value_of_Self_employment_Businesses.

Kruskal-Wallis rank sum test

11

Amount_spent_on_Food_at_Home

Is there a significant relationship between education level and amount spent on food at home?

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Amount_spent_on_Food_at_Home.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Amount_spent_on_Food_at_Home.

Kruskal-Wallis rank sum test

12

Income_From_Other_Sources

Is there a significant relationship between education level and income earned from other sources?

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Income_From_Other_Sources.

There is a difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Income_From_Other_Sources.

Kruskal-Wallis rank sum test

T-tests

The following box plots show how the means of the metric variables for male and female.

t.test(Total_Gross_Income ~ Gender, data = hcfs)
## 
##  Welch Two Sample t-test
## 
## data:  Total_Gross_Income by Gender
## t = -23.467, df = 5521.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -6866.277 -5807.520
## sample estimates:
## mean in group Female   mean in group Male 
##             20823.63             27160.53

We conducted a Welch Two Sample t-test to determine if there is a difference in the Total Gross Income between males and females. The test was performed with a significance level of 0.05. The test showed that the t-value was -23.467 with a degrees of freedom (df) of 5521.1 and a p-value of less than 2.2e-16, which is much smaller than the significance level. This indicates strong evidence against the null hypothesis and suggests that there is a statistically significant difference in the means of Total Gross Income between males and females. The 95 percent confidence interval for the difference in means ranged from -6866.277 to -5807.520. The sample mean for females was 20823.63 and for males it was 27160.53.

Non-Parametric Tests with Gender

hcfs_subset <- hcfs[, c("AMount_Spent_on_Utilities", "Amount_Spent_on_Consumer_Goods_Services", "Employee_Income", "Self_Employment_income", "Financial_assets_Income", "Value_of_Self_employment_Businesses", "Pension_Income", "Amount_spent_on_Food_at_Home", "Rental_Income", "Credit_Card_Debt", "Value_of_Saving_Accounts", "Income_From_Other_Sources", "Gender")]

cols_of_interest <- c("AMount_Spent_on_Utilities", "Amount_Spent_on_Consumer_Goods_Services", "Employee_Income",
                      "Self_Employment_income", "Financial_assets_Income", "Value_of_Self_employment_Businesses",
                      "Pension_Income", "Amount_spent_on_Food_at_Home", "Rental_Income", "Credit_Card_Debt",
                      "Value_of_Saving_Accounts", "Income_From_Other_Sources")

# Perform Wilcoxon-Mann-Whitney test for each column
for(col in cols_of_interest) {
  test_res <- hcfs_subset %>%
    wilcox.test(formula = as.formula(paste(col, "~ Gender")), data = .)
  
  print(paste0("Column: ", col))
  print(test_res)
}
## [1] "Column: AMount_Spent_on_Utilities"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  AMount_Spent_on_Utilities by Gender
## W = 6497246, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Amount_Spent_on_Consumer_Goods_Services"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Amount_Spent_on_Consumer_Goods_Services by Gender
## W = 5289522, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Employee_Income"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Employee_Income by Gender
## W = 6492337, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Self_Employment_income"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Self_Employment_income by Gender
## W = 6947133, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Financial_assets_Income"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Financial_assets_Income by Gender
## W = 6668344, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Value_of_Self_employment_Businesses"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Value_of_Self_employment_Businesses by Gender
## W = 6962701, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Pension_Income"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Pension_Income by Gender
## W = 7612983, p-value = 0.9046
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Amount_spent_on_Food_at_Home"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Amount_spent_on_Food_at_Home by Gender
## W = 5153925, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Rental_Income"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Rental_Income by Gender
## W = 7509804, p-value = 0.002774
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Credit_Card_Debt"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Credit_Card_Debt by Gender
## W = 7580774, p-value = 0.004729
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Value_of_Saving_Accounts"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Value_of_Saving_Accounts by Gender
## W = 7699595, p-value = 0.3272
## alternative hypothesis: true location shift is not equal to 0
## 
## [1] "Column: Income_From_Other_Sources"
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Income_From_Other_Sources by Gender
## W = 7585036, p-value = 0.201
## alternative hypothesis: true location shift is not equal to 0

Each test is a Wilcoxon rank sum test with continuity correction that is used to compare two groups: male and female, for each variable. The null hypothesis is that there is no difference between the two groups and the alternative hypothesis is that there is a difference between the two groups.For all variables except “Pension Income”, the p-value is less than 0.05, which means that we reject the null hypothesis and conclude that there is a significant difference between the two groups for these variables. The p-values for “Pension Income”, “Value of Saving Accounts” and “Income From Other Sources” are greater than 0.05, which means that we fail to reject the null hypothesis and conclude that there is no significant difference between the two groups for these variables. Numeric values for the test statistic (W) and p-value are given for each variable.

The test statistic measures the difference between the median values of the two groups, and the p-value represents the probability of obtaining the observed test statistic, or one more extreme, assuming the null hypothesis is true. In conclusion, the results of the Wilcoxon rank sum tests suggest that there is a significant difference between male and female for most of the variables except “Pension Income”, “Value of Saving Accounts” and “Income From Other Sources”.

These results suggest that gender is a significant factor in determining the differences in the variables such as “Amount Spent on Utilities”, “Amount Spent on Consumer Goods and Services”, “Employee Income”, “Self Employment Income”, “Financial Assets Income”, “Value of Self Employment Businesses”, “Amount Spent on Food at Home”, “Rental Income” and “Credit Card Debt”.

Hypothesis Test Results

Null Hypothesis

Test

P value

Result

There is no difference between the Male and Female groups with respect to the dependent variable Total Gross Income.

T test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Amount spent on utilities.

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the male and female groups with respect to the dependent variable Amount spent on Consumer goods and services

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Employee_Income.

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Self_Employee_Income

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Financial_assets_Income

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Pension_Income.

Wilcoxon-Mann-Whitney test

0.904600

Accepted

There is no difference between the Male and Female groups with respect to the dependent variable Rental_Income.

Wilcoxon-Mann-Whitney test

0.002982

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Credit_Card_Debt.

Wilcoxon-Mann-Whitney test

0.004729

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Value_of_Saving_Accounts.

Wilcoxon-Mann-Whitney test

0.327200

Accepted

There is no difference between the Male and Female groups with respect to the dependent variable Value_of_Self_employment_Businesses.

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Amount_spent_on_Food_at_Home.

Wilcoxon-Mann-Whitney test

0.002000

Rejected

There is no difference between the Male and Female groups with respect to the dependent variable Income_From_Other_Sources.

Wilcoxon-Mann-Whitney test

0.201000

Accepted

ANOVA

Before performing the ANOVA test the following plots were drawn,

fit <- lm(Total_Gross_Income ~ Age, data = hcfs)

anova(fit)
## Analysis of Variance Table
## 
## Response: Total_Gross_Income
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Age          5 7.0898e+10 1.4180e+10  109.38 < 2.2e-16 ***
## Residuals 8150 1.0566e+12 1.2964e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis presents an ANOVA table for the response variable Total_Gross_Income, which is being analyzed in terms of the Age group. The null hypothesis in this test is that there is no significant difference in the mean Total_Gross_Income across the different age groups, and the alternative hypothesis is that there is a significant difference in the mean Total_Gross_Income across at least one age group. The table shows that there are 5 degrees of freedom for Age and 8150 degrees of freedom for Residuals. The sum of squares for Age is 7.0898e+10, while the sum of squares for Residuals is 1.0566e+12. The mean sum of squares for Age is 1.4180e+10, while the mean sum of squares for Residuals is 1.2964e+08. The F-statistic for this test is 109.38, which has a p-value less than 2.2e-16, indicating that there is significant evidence to reject the null hypothesis. Therefore, we can conclude that there is a significant difference in the mean Total_Gross_Income across at least one age group. The age group variable is a significant predictor of the Total_Gross_Income, and we can reject the null hypothesis.

fit <- lm(Total_Gross_Income ~ Education_Level, data = hcfs)
anova(fit)
## Analysis of Variance Table
## 
## Response: Total_Gross_Income
##                   Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Education_Level    3 2.1248e+11 7.0827e+10  631.04 < 2.2e-16 ***
## Residuals       8152 9.1497e+11 1.1224e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Above we have analysis of variance table for the response variable Total_Gross_Income, where the data has been divided by Education_Level. The table shows the results of the one-way ANOVA, which tests whether there are any statistically significant differences in the mean Total_Gross_Income between the different Education_Level groups.

The table reports two degrees of freedom (df) values: Df for Education_Level and Df for Residuals. The Sum Sq column displays the sum of squares for each source of variation, while the Mean Sq column displays the mean sum of squares for each source of variation. The F-value and its associated p-value (Pr(>F)) test whether there is a significant difference between groups, with a lower p-value indicating a greater likelihood that the differences are statistically significant. In this case, the p-value is less than 0.001 (< 2.2e-16), which indicates strong evidence that there are statistically significant differences between the mean Total_Gross_Income for the different Education_Level groups. Therefore, we reject the null hypothesis that there is no difference in mean Total_Gross_Income between the groups.

Non-Parametric Tests with Age

hcfs_selected <- hcfs %>% select(Age, AMount_Spent_on_Utilities, Amount_Spent_on_Consumer_Goods_Services,
                                  Employee_Income, Self_Employment_income, Financial_assets_Income,
                                  Value_of_Self_employment_Businesses, Amount_spent_on_Food_at_Home, Credit_Card_Debt,
                                  Value_of_Saving_Accounts, Income_From_Other_Sources)

for (col in 2:ncol(hcfs_selected)) {
  kw_result <- kruskal.test(as.formula(paste(colnames(hcfs_selected)[col], "~", "Age")), data = hcfs_selected)
  print(paste("Column:", colnames(hcfs_selected)[col]))
  print(kw_result)
}
## [1] "Column: AMount_Spent_on_Utilities"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  AMount_Spent_on_Utilities by Age
## Kruskal-Wallis chi-squared = 131.27, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Amount_Spent_on_Consumer_Goods_Services"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Amount_Spent_on_Consumer_Goods_Services by Age
## Kruskal-Wallis chi-squared = 456.43, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Employee_Income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Employee_Income by Age
## Kruskal-Wallis chi-squared = 3296.3, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Self_Employment_income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Self_Employment_income by Age
## Kruskal-Wallis chi-squared = 455.15, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Financial_assets_Income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Financial_assets_Income by Age
## Kruskal-Wallis chi-squared = 239.95, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Value_of_Self_employment_Businesses"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value_of_Self_employment_Businesses by Age
## Kruskal-Wallis chi-squared = 445.62, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Amount_spent_on_Food_at_Home"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Amount_spent_on_Food_at_Home by Age
## Kruskal-Wallis chi-squared = 443.74, df = 5, p-value < 2.2e-16
## 
## [1] "Column: Credit_Card_Debt"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Credit_Card_Debt by Age
## Kruskal-Wallis chi-squared = 47.515, df = 5, p-value = 4.461e-09
## 
## [1] "Column: Value_of_Saving_Accounts"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value_of_Saving_Accounts by Age
## Kruskal-Wallis chi-squared = 58.31, df = 5, p-value = 2.714e-11
## 
## [1] "Column: Income_From_Other_Sources"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Income_From_Other_Sources by Age
## Kruskal-Wallis chi-squared = 152.8, df = 5, p-value < 2.2e-16

The Kruskal-Wallis rank sum test was performed on 10 different columns of data categorized by age groups. The test was used to determine whether there were statistically significant differences between the medians of each age group for each variable. For the column “Amount_Spent_on_Utilities”, the Kruskal-Wallis chi-squared value was 131.27 with 5 degrees of freedom and a p-value of less than 2.2e-16, indicating strong evidence of a significant difference in median amount spent on utilities across age groups. Similarly, for the remaining nine columns, the Kruskal-Wallis test yielded chi-squared values and p-values that strongly suggested significant differences in median values across age groups. Based on these results, we reject the null hypothesis that there are no differences in median values across age groups for each variable, and conclude that age is a significant factor in determining the median value for each variable.

Non-Parametric Tests with Education level

hcfs_selected <- hcfs %>% select(Education_Level, AMount_Spent_on_Utilities, Amount_Spent_on_Consumer_Goods_Services,
                                  Employee_Income, Self_Employment_income, Financial_assets_Income,
                                  Value_of_Self_employment_Businesses, Pension_Income,
                                  Amount_spent_on_Food_at_Home, Credit_Card_Debt,
                                  Value_of_Saving_Accounts, Income_From_Other_Sources)


for (col in 2:ncol(hcfs_selected)) {
  kw_result <- kruskal.test(as.formula(paste(colnames(hcfs_selected)[col], "~", "Education_Level")), data = hcfs_selected)
  print(paste("Column:", colnames(hcfs_selected)[col]))
  print(kw_result)
}
## [1] "Column: AMount_Spent_on_Utilities"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  AMount_Spent_on_Utilities by Education_Level
## Kruskal-Wallis chi-squared = 453.25, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Amount_Spent_on_Consumer_Goods_Services"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Amount_Spent_on_Consumer_Goods_Services by Education_Level
## Kruskal-Wallis chi-squared = 1184.9, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Employee_Income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Employee_Income by Education_Level
## Kruskal-Wallis chi-squared = 1371.7, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Self_Employment_income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Self_Employment_income by Education_Level
## Kruskal-Wallis chi-squared = 336.57, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Financial_assets_Income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Financial_assets_Income by Education_Level
## Kruskal-Wallis chi-squared = 681.79, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Value_of_Self_employment_Businesses"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value_of_Self_employment_Businesses by Education_Level
## Kruskal-Wallis chi-squared = 327.81, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Pension_Income"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Pension_Income by Education_Level
## Kruskal-Wallis chi-squared = 342.78, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Amount_spent_on_Food_at_Home"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Amount_spent_on_Food_at_Home by Education_Level
## Kruskal-Wallis chi-squared = 622.09, df = 3, p-value < 2.2e-16
## 
## [1] "Column: Credit_Card_Debt"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Credit_Card_Debt by Education_Level
## Kruskal-Wallis chi-squared = 26.302, df = 3, p-value = 8.246e-06
## 
## [1] "Column: Value_of_Saving_Accounts"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value_of_Saving_Accounts by Education_Level
## Kruskal-Wallis chi-squared = 27.467, df = 3, p-value = 4.699e-06
## 
## [1] "Column: Income_From_Other_Sources"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Income_From_Other_Sources by Education_Level
## Kruskal-Wallis chi-squared = 29.733, df = 3, p-value = 1.571e-06

The test results can be used to evaluate whether there is a statistically significant difference between the education levels in terms of the amount spent on utilities, amount spent on consumer goods and services, employee income, self-employment income, financial assets income, value of self-employment businesses, pension income, amount spent on food at home, credit card debt, value of savings accounts, and income from other sources. For all columns, the p-value is less than 0.05, indicating that there is a statistically significant difference between the education levels with respect to the amount spent on utilities, amount spent on consumer goods and services, employee income, self-employment income, financial assets income, value of self-employment businesses, pension income, amount spent on food at home, credit card debt, value of savings accounts, and income from other sources.

Therefore, we reject the null hypothesis that there is no significant difference between the education levels with respect to the amount spent on utilities, amount spent on consumer goods and services, employee income, self-employment income, financial assets income, value of self-employment businesses, pension income, amount spent on food at home, credit card debt, value of savings accounts, and income from other sources. The alternative hypothesis is that there is a significant difference between the education levels for these variables.

For example, for the column “Amount_Spent_on_Utilities,” the Kruskal-Wallis chi-squared value is 453.25, with 3 degrees of freedom and a p-value of less than 2.2e-16, which is less than the significance level of 0.05. Thus, we reject the null hypothesis.

Hypothesis Test Results - Age

Null Hypothesis

Test

P value

Result

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Total_Gross_Income.

ANOVA

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_Spent_on_Utilities.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_Spent_on_Consumer_Goods_Services.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Employee_Income.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Self_Employment_income.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Financial_assets_Income.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Credit_Card_Debt.

Kruskal-Wallis rank sum test

0.0004

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Value_of_Saving_Accounts.

Kruskal-Wallis rank sum test

0.0002

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Value_of_Self_employment_Businesses.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Amount_spent_on_Food_at_Home.

Kruskal-Wallis rank sum test

0.0020

Rejected

There is no difference between the 6 categories of the independent variable Age
with respect to the dependent variable Income_From_Other_Sources.

Kruskal-Wallis rank sum test

0.0020

Rejected

Hypothesis Test Results - Education Level

Null Hypothesis

Test

P value

Result

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Total_Gross_Income.

ANOVA

0.00200

Rejected

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Amount_Spent_on_Utilities.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Amount_spent_on_consumer_goods.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Employee_Income.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Self_Employment_income.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Financial_assets_Income.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable Education_Level with respect to the dependent variable Pension_Income.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Credit_Card_Debt.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level with respect to the dependent variable Value_of_Saving_Accounts.

Kruskal-Wallis rank sum test

0.00046

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Value_of_Self_employment_Businesses.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Amount_spent_on_Food_at_Home.

Kruskal-Wallis rank sum test

0.00200

Rejected

There is no difference between the 4 categories of the independent variable
Education_Level
with respect to the dependent variable Income_From_Other_Sources.

Kruskal-Wallis rank sum test

0.00015

Rejected

Chi-Square

To determine if there is significant association between gender and the categorical variables in our dataset we performed chi square test on selected subset of dataset with 12 different HAS variables.

The following Hypothesis statements are considered,

  • Null hypothesis (H0): There is no association between gender and any of the listed categorical variables related to assets and financial status.

  • Alternative hypothesis (HA): There is an association between gender and at least one of the listed categorical variables related to assets and financial status..

The variables are listed in the first column and are of character data type as yes or no responses. The test statistic value and p-value are provided for each variable.

##                                variable            test   statistic
## X-squared               Has_Real_Assets Chi-square test  33.6683070
## X-squared1         Has_Financial_Assets Chi-square test  70.5789591
## X-squared2                 Has_Vehicles Chi-square test 997.8911625
## X-squared3                Has_Valuables Chi-square test   2.4927087
## X-squared4       Has_Real_Estate_Wealth Chi-square test  62.3978583
## X-squared5                 Has_Deposits Chi-square test  69.9052432
## X-squared6             Has_Mutual_Funds Chi-square test  23.2308071
## X-squared7                    Has_Bonds Chi-square test  11.8020341
## X-squared8                   Has_Shares Chi-square test  40.6088141
## X-squared9                     Has_Debt Chi-square test  50.9562701
## X-squared10        Has_Credit_Card_Debt Chi-square test   7.2595582
## X-squared11           Has_Private_Loans Chi-square test   0.2819215
## X-squared12 Has_Applied_for_Loan_Credit Chi-square test  29.3152621
##                   p.value
## X-squared    6.535688e-09
## X-squared1   4.422084e-17
## X-squared2  5.160023e-219
## X-squared3   1.143747e-01
## X-squared4   2.806282e-15
## X-squared5   6.222284e-17
## X-squared6   1.436772e-06
## X-squared7   5.916604e-04
## X-squared8   1.859659e-10
## X-squared9   9.444683e-13
## X-squared10  7.052464e-03
## X-squared11  5.954445e-01
## X-squared12  6.150933e-08

From the results, we can see that for most of the variables, the p-value is less than 0.05, which is the commonly used significance level. This means that we can reject the null hypothesis and conclude that there is a significant association between gender and the listed variables. However, for the variable Has_Valuables and Has_Private_Loans the p-value is greater than 0.05, which means that we cannot reject the null hypothesis and conclude that there is no significant association between gender and Has_Valuables or Has_Private_Loans.

Chi Square Test Results

Gender vs

P value

Result

Has_Real_Assets

0.000653

Rejected

Has_Financial_Assets

0.000442

Rejected

Has_Vehicles

0.000516

Rejected

Has_Valuables

0.114300

Accepted

Has_Real_Estate_Wealth

0.000280

Rejected

Has_Deposits

0.000640

Rejected

Has_Mutual_Funds

0.000140

Rejected

Has_Bonds

0.000580

Rejected

Has_Shares

0.000185

Rejected

Has_Debt

0.000944

Rejected

Has_Credit_Card_Debt

0.007000

Rejected

Has_Private_Loans

0.590000

Accepted

Has_Applied_for_Loan_Credit

0.000615

Rejected

Regression

Linear Regression

To check how does total gross income affect the amount spent on consumer goods and services we considered a linear regression model assuming that there is linear relationship between the variables we performed the test and obtained the following result.

## 
## Call:
## lm(formula = Amount_Spent_on_Consumer_Goods_Services ~ Total_Gross_Income, 
##     data = hcfs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1600.0  -309.3   -50.3   169.7  8464.4 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.778e+02  1.529e+01   24.71   <2e-16 ***
## Total_Gross_Income 3.432e-02  5.552e-04   61.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 589.5 on 8154 degrees of freedom
## Multiple R-squared:  0.3191, Adjusted R-squared:  0.319 
## F-statistic:  3821 on 1 and 8154 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

A linear regression analysis was performed to examine the influence of the variable Total_Gross_Income on the variable Amount_Spent_on_Consumer_Goods_Services.

The regression model showed that the variable Total_Gross_Income explained 31.91% of the variance from the variable Amount_Spent_on_Consumer_Goods_Services. An ANOVA was used to test whether this value was significantly different from zero. Using the present sample, it was found that the effect was significantly different from zero, F=3821.04, p = <.001, R2 = 0.32.

The following regression model is obtained,

Amount_Spent_on_Consumer_Goods_Services = 377.79 +0.03 · Total_Gross_Income

When all independent variables are zero, the value of the variable Amount_Spent_on_Consumer_Goods_Services is 377.79. If the value of the variable Total_Gross_Income changes by one unit, the value of the variable Amount_Spent_on_Consumer_Goods_Services changes by 0.03.

The standardized coefficients beta are independent of the measured variable and are always between -1 and 1. The larger the amount of beta, the greater the contribution of the respective independent variable to explain the dependent variable Amount_Spent_on_Consumer_Goods_Services . In this model, the variable Total_Gross_Income has the greatest influence on the variable Amount_Spent_on_Consumer_Goods_Services.

A linear regression analysis was performed to examine the influence of the variable Value_of_Saving_Accounts on the variable Amount_Spent_on_Consumer_Goods_Services.

## 
## Call:
## lm(formula = Amount_Spent_on_Consumer_Goods_Services ~ Value_of_Saving_Accounts, 
##     data = hcfs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1115   -505   -205    295   8795 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.205e+03  8.767e+00 137.440  < 2e-16 ***
## Value_of_Saving_Accounts 2.604e-02  3.620e-03   7.193 6.91e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 712.2 on 8154 degrees of freedom
## Multiple R-squared:  0.006305,   Adjusted R-squared:  0.006183 
## F-statistic: 51.74 on 1 and 8154 DF,  p-value: 6.909e-13
## `geom_smooth()` using formula = 'y ~ x'

The regression model showed that the variable Value_of_Saving_Accounts explained 0.63% of the variance from the variable Amount_Spent_on_Consumer_Goods_Services. An ANOVA was used to test whether this value was significantly different from zero. Using the present sample, it was found that the effect was significantly different from zero, F=51.74, p = <.001, R2 = 0.01.

The following regression model is obtained,

Amount_Spent_on_Consumer_Goods_Services = 1204.98 +0.03 · Value_of_Saving_Accounts

When all independent variables are zero, the value of the variable Amount_Spent_on_Consumer_Goods_Services is 1204.98. If the value of the variable Value_of_Saving_Accounts changes by one unit, the value of the variable Amount_Spent_on_Consumer_Goods_Services changes by 0.03. In this model, the variable Value_of_Saving_Accounts has the greatest influence on the variable Amount_Spent_on_Consumer_Goods_Services.

Multiple Linear Regression

A multiple linear regression analysis was performed to examine the influence of the variables Total_Gross_Income and Value_of_Saving_Accounts on the variable Total_Real_Assets.

## [1] 0
## 
## Call:
## lm(formula = Total_Real_Assets ~ Total_Gross_Income + Value_of_Saving_Accounts, 
##     data = hcfs_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
##  -423647  -137611   -53872    47574 13297724 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9570.1977  8647.2241   1.107    0.268    
## Total_Gross_Income          8.4182     0.3125  26.938   <2e-16 ***
## Value_of_Saving_Accounts    2.0699     1.6867   1.227    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 330600 on 8153 degrees of freedom
## Multiple R-squared:  0.08303,    Adjusted R-squared:  0.08281 
## F-statistic: 369.1 on 2 and 8153 DF,  p-value: < 2.2e-16

The regression model showed that the variables Total_Gross_Income and Value_of_Saving_Accounts explained 8.3% of the variance from the variable Total_Real_Assets. An ANOVA was used to test whether this value was significantly different from zero. Using the present sample, it was found that the effect was significantly different from zero, F=369.14, p = <.001, R2 = 0.08.

The following regression model is obtained,

Total_Real_Assets = 9570.2 +8.42 · Total_Gross_Income +2.07 · Value_of_Saving_Accounts

When all independent variables are zero, the value of the variable Total_Real_Assets is 9570.2. If the value of the variable Total_Gross_Income changes by one unit, the value of the variable Total_Real_Assets changes by 8.42. If the value of the variable Value_of_Saving_Accounts changes by one unit, the value of the variable Total_Real_Assets changes by 2.07. In this model, the variable Total_Gross_Income has the greatest influence on the variable Total_Real_Assets.

A multiple linear regression analysis was performed to examine the influence of the variables Employee_Income, Self_Employment_income, Rental_Income, Financial_assets_Income and Pension_Income on the variable Value_of_Household_Vehicles.

## [1] 0
## 
## Call:
## lm(formula = Value_of_Household_Vehicles ~ Employee_Income + 
##     Self_Employment_income + Rental_Income + Financial_assets_Income + 
##     Pension_Income, data = hcfs_subset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81561  -3474  -1920   1950 253160 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.976e+03  1.536e+02  12.866  < 2e-16 ***
## Employee_Income         1.610e-01  4.506e-03  35.721  < 2e-16 ***
## Self_Employment_income  1.355e-01  4.287e-03  31.609  < 2e-16 ***
## Rental_Income           3.051e-02  3.047e-02   1.001    0.317    
## Financial_assets_Income 3.358e-01  5.765e-02   5.824 5.95e-09 ***
## Pension_Income          9.425e-02  6.357e-03  14.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8369 on 8150 degrees of freedom
## Multiple R-squared:  0.2292, Adjusted R-squared:  0.2287 
## F-statistic: 484.7 on 5 and 8150 DF,  p-value: < 2.2e-16

The regression model showed that the variables Employee_Income, Self_Employment_income, Rental_Income, Financial_assets_Income and Pension_Income explained 22.92% of the variance from the variable Value_of_Household_Vehicles. An ANOVA was used to test whether this value was significantly different from zero. Using the present sample, it was found that the effect was significantly different from zero, F=484.68, p = <.001, R2 = 0.23.

The following regression model is obtained,

Value_of_Household_Vehicles = 1975.53 +0.16 · Employee_Income +0.14 · Self_Employment_income +0.03 · Rental_Income +0.34 · Financial_assets_Income +0.09 · Pension_Income

  • When all independent variables are zero, the value of the variable Value_of_Household_Vehicles is 1975.53.

  • If the value of the variable Employee_Income changes by one unit, the value of the variable Value_of_Household_Vehicles changes by 0.16.

  • If the value of the variable Self_Employment_income changes by one unit, the value of the variable Value_of_Household_Vehicles changes by 0.14.

  • If the value of the variable Rental_Income changes by one unit, the value of the variable Value_of_Household_Vehicles changes by 0.03.

  • If the value of the variable Financial_assets_Income changes by one unit, the value of the variable Value_of_Household_Vehicles changes by 0.34.

  • If the value of the variable Pension_Income changes by one unit, the value of the variable Value_of_Household_Vehicles changes by 0.09.

In this model, the variable Employee_Income has the greatest influence on the variable Value_of_Household_Vehicles.

Logistic Regression

To perform Logistic Regression we considered the following questions,

  1. What is the relationship between age, education level, employment status and the likelihood of having credit card debt?
  2. Can likelihood of having mutual funds be predicted based on Gender, Education level and other variables?
## 
## Call:
## glm(formula = Has_Credit_Card_Debt ~ Education_Level + Employment_status, 
##     family = binomial, data = hcfs_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2218  -0.1580  -0.0994  -0.0602   3.6331  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.8523     0.2578 -14.943  < 2e-16 ***
## Education_LevelLower secondary    -0.7243     0.3433  -2.110  0.03487 *  
## Education_LevelPrimary education  -1.8149     0.6636  -2.735  0.00624 ** 
## Education_LevelUpper secondary    -0.5250     0.3099  -1.694  0.09022 .  
## Employment_statusOther            -1.7361     1.0411  -1.668  0.09541 .  
## Employment_statusRetired          -0.9312     0.3572  -2.607  0.00914 ** 
## Employment_statusSelf-employed     0.1594     0.3243   0.492  0.62302    
## Employment_statusUnemployed       -0.8654     1.0193  -0.849  0.39585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747.99  on 8155  degrees of freedom
## Residual deviance: 706.28  on 8148  degrees of freedom
## AIC: 722.28
## 
## Number of Fisher Scoring iterations: 9

The logistic regression model includes Education_Level and Employment_status as predictors of whether a person has credit card debt. The coefficients show the direction and magnitude of the effect of each predictor on the outcome variable.

The intercept coefficient is -3.8523, which is the log-odds of having credit card debt when all predictors are zero.

The Education_Level coefficients show that compared to having a tertiary education level, having a lower secondary or primary education level is associated with a lower log-odds of having credit card debt. The coefficient for upper secondary education level is not statistically significant.

The Employment_status coefficients show that compared to being employed full-time, being retired is associated with a lower log-odds of having credit card debt, while being self-employed or unemployed is not significantly associated with credit card debt. The coefficient for “other” employment status is not statistically significant.

The deviance residuals indicate that the model fits the data reasonably well, and the AIC is 722.28, which suggests that the model is a good fit.

# Subset the data for relevant columns
hcfs_subset <- hcfs[, c("Education_Level", "Employment_status", "Has_Applied_for_Loan_Credit", "Housing_Status")]

hcfs_subset <- hcfs_subset %>%
  mutate(Has_Applied_for_Loan_Credit = recode(Has_Applied_for_Loan_Credit, "Yes" = 1, "No" = 0, .default = 0))

# Fit a logistic regression model
logit_model <- glm(Has_Applied_for_Loan_Credit  ~ Education_Level + Employment_status + Housing_Status, data=hcfs_subset, family=binomial)

# Summarize the model
summary(logit_model)
## 
## Call:
## glm(formula = Has_Applied_for_Loan_Credit ~ Education_Level + 
##     Employment_status + Housing_Status, family = binomial, data = hcfs_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9373  -0.3976  -0.3104  -0.2351   2.7752  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.9943     0.1444 -20.731  < 2e-16 ***
## Education_LevelLower secondary      0.4076     0.1496   2.725 0.006431 ** 
## Education_LevelPrimary education    0.1868     0.1880   0.993 0.320533    
## Education_LevelUpper secondary      0.2550     0.1453   1.755 0.079330 .  
## Employment_statusOther             -0.7568     0.2052  -3.688 0.000226 ***
## Employment_statusRetired           -0.8352     0.1315  -6.350 2.16e-10 ***
## Employment_statusSelf-employed      0.4114     0.1206   3.413 0.000644 ***
## Employment_statusUnemployed        -0.3140     0.2642  -1.189 0.234591    
## Housing_StatusOwner with mortgage   1.5803     0.1208  13.081  < 2e-16 ***
## Housing_StatusRenter                0.5554     0.1049   5.293 1.20e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4137.8  on 8155  degrees of freedom
## Residual deviance: 3791.5  on 8146  degrees of freedom
## AIC: 3811.5
## 
## Number of Fisher Scoring iterations: 6

The logistic regression model assesses the association between the probability of having applied for a loan credit and the independent variables education level, employment status, and housing status. The coefficients and standard errors of the model indicate the direction and strength of the relationship between the dependent variable and independent variables.

The p-values associated with the coefficients of the independent variables show that education level, employment status, and housing status are all significant predictors of having applied for a loan credit. Among the education levels, those with lower secondary education are more likely to apply for loan credit compared to those with primary education. Similarly, among employment status, those who are self-employed and those who have other employment status are more likely to apply for loan credit compared to those who are unemployed. Among housing status, those who own a house with a mortgage and those who rent are more likely to apply for loan credit compared to those who live in other housing arrangements.

The null and residual deviance and AIC show that the model provides a good fit to the data. The number of Fisher Scoring iterations indicates the number of iterations required to fit the model to the data.

# Subset the data for relevant columns
hcfs_subset <- hcfs[, c("Education_Level", "Employment_status", "Has_Mutual_Funds", "Gender", "Has_Real_Assets")]

hcfs_subset <- hcfs_subset %>%
  mutate(Has_Mutual_Funds = recode(Has_Mutual_Funds, "Yes" = 1, "No" = 0, .default = 0))

# Fit a logistic regression model
logit_model <- glm(Has_Mutual_Funds  ~ Education_Level + Employment_status + Has_Real_Assets + Gender, data=hcfs_subset, family=binomial)

# Summarize the model
summary(logit_model)
## 
## Call:
## glm(formula = Has_Mutual_Funds ~ Education_Level + Employment_status + 
##     Has_Real_Assets + Gender, family = binomial, data = hcfs_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7282  -0.4003  -0.2578  -0.2078   3.1954  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -4.1608     1.0087  -4.125 3.71e-05 ***
## Education_LevelLower secondary    -1.6085     0.1462 -11.000  < 2e-16 ***
## Education_LevelPrimary education  -2.0458     0.1882 -10.872  < 2e-16 ***
## Education_LevelUpper secondary    -0.7048     0.1143  -6.164 7.09e-10 ***
## Employment_statusOther            -0.7731     0.3402  -2.273  0.02303 *  
## Employment_statusRetired           0.3260     0.1131   2.883  0.00393 ** 
## Employment_statusSelf-employed     0.5870     0.1333   4.405 1.06e-05 ***
## Employment_statusUnemployed       -0.6422     0.4625  -1.389  0.16495    
## Has_Real_AssetsYes                 2.0374     1.0039   2.030  0.04240 *  
## GenderMale                         0.3442     0.1112   3.094  0.00197 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3716.8  on 8155  degrees of freedom
## Residual deviance: 3401.3  on 8146  degrees of freedom
## AIC: 3421.3
## 
## Number of Fisher Scoring iterations: 7
library(ggplot2)

# Create a data frame with predictor values
newdata <- expand.grid(
  Education_Level = unique(hcfs_subset$Education_Level),
  Employment_status = unique(hcfs_subset$Employment_status),
  Gender = unique(hcfs_subset$Gender),
  Has_Real_Assets = unique(hcfs_subset$Has_Real_Assets)
)

# Add predicted probabilities to data frame
newdata$prob <- predict(logit_model, newdata, type="response")

# Plot the logistic regression curve
ggplot(newdata, aes(x=prob, color=Employment_status)) +
  geom_density() +
  xlab("Predicted Probability of Having Mutual Funds") +
  ylab("Density") +
  ggtitle("Logistic Regression Curve")

The logistic regression model tests the association between the binary response variable “Has_Mutual_Funds” and the predictor variables “Education_Level,” “Employment_status,” “Has_Real_Assets,” and “Gender.” The model’s deviance residuals indicate that the model fits the data well.

The coefficients of the model reveal that individuals with lower education levels are less likely to have mutual funds, with estimates of -1.61 for “Lower secondary,” -2.05 for “Primary education,” and -0.70 for “Upper secondary” education levels. Retired individuals and those who are self-employed are more likely to have mutual funds, with estimates of 0.33 and 0.59, respectively, while individuals in other employment status categories are less likely to have mutual funds.

Moreover, individuals with real assets are more likely to have mutual funds, with an estimate of 2.04, and male individuals are more likely to have mutual funds than female individuals, with an estimate of 0.34. The significance codes reveal that all coefficients are statistically significant, except for “Employment_statusUnemployed” and “Has_Real_AssetsYes” at the 0.05 significance level.

The null and residual deviances of the model suggest that the model explains a substantial amount of the variation in the data. The Akaike information criterion (AIC) value of 3421.3 indicates that this model is better than other candidate models with higher AIC values.

Principal Component Analysis

To reduce the dimensionality of the data by identifying the most important variables, which can be used to represent the data, Principal Component Analysis of the numeric columns in the dataset was performed.

# Select only the numeric variables from hcfs
hcfs_numeric <- hcfs %>% 
  select_if(is.numeric)

#str(hcfs_numeric)

keep1<-subset(hcfs_numeric, select = 3:ncol(hcfs_numeric))

# # scale the data
# hcfs_scaled <- scale(keep1)
# 
# # perform PCA
# hcfs_pca <- prcomp(hcfs_scaled, center = TRUE, scale. = TRUE)
# 
# # view summary of the PCA results
# summary(hcfs_pca)
# 
# # plot the PCA results
# biplot(hcfs_pca)

# Principal Components Analysis creating 15 principal components (i.e. artificial variables)

cat("Doing PCA\n")
## Doing PCA
# change the number “15” in the code below this line if you want to adjust the number of principal components to be created from your data
pc <- principal(keep1, nfactors=min(ncol(keep1),15), rotate="varimax") #rotated
print(summary(pc)) # print the variance accounted for by each principal component
## 
## Factor analysis with Call: principal(r = keep1, nfactors = min(ncol(keep1), 15), rotate = "varimax")
## 
## Test of the hypothesis that 15 factors are sufficient.
## The degrees of freedom for the model is 40  and the objective function was  41.33 
## The number of observations was  8156  with Chi Square =  336272.5  with prob <  0 
## 
## The root mean square of the residuals (RMSA) is  0.04 
## NULL
print(loadings(pc)) # pc loadings for each observed variable
## 
## Loadings:
##                                         RC1    RC2    RC3    RC5    RC14  
## Value_of_Household_Vehicles              0.270                0.165  0.814
## Valuables                                              0.984              
## Deposits                                 0.132  0.483  0.190  0.147 -0.186
## Mutual_Funds                                    0.233                     
## Bonds                                           0.882                0.143
## Employee_Income                          0.522               -0.209  0.211
## Self_Employment_income                   0.121         0.148  0.855  0.103
## Rental_Income                                   0.181                     
## Financial_assets_Income                  0.100  0.905                     
## Pension_Income                           0.298  0.130        -0.172       
## Total_Real_Assets                        0.253  0.231  0.487  0.297  0.129
## Total_Financial_Assets                   0.118  0.851  0.120              
## Total_Gross_Income                       0.737                       0.236
## Value_of_Self_employment_Businesses      0.131                0.861  0.129
## Income_From_Other_Sources                                                 
## Credit_Card_Debt                                                          
## Monthly_Amount_Paid_As_Rent                                               
## Total_Value_of_Cars                      0.315                0.193  0.870
## Value_Of_Other_Vehicles                                              0.227
## Value_Of_Other_Valuables                               0.984              
## No_of_PrivateLoans                                                        
## Value_of_Saving_Accounts                                                  
## Amount_spent_on_Food_at_Home             0.802                            
## Amount_Spent_on_Food_Outside_Home        0.611                0.192       
## AMount_Spent_on_Utilities                0.490         0.168              
## Amount_Spent_on_Consumer_Goods_Services  0.826  0.156         0.134  0.139
##                                         RC6    RC4    RC12   RC13   RC8   
## Value_of_Household_Vehicles                     0.455                     
## Valuables                                                                 
## Deposits                                        0.390  0.218         0.343
## Mutual_Funds                                           0.923              
## Bonds                                                 -0.233              
## Employee_Income                         -0.713                            
## Self_Employment_income                                                    
## Rental_Income                                                 0.957       
## Financial_assets_Income                                0.263  0.158       
## Pension_Income                           0.870                            
## Total_Real_Assets                               0.135         0.363       
## Total_Financial_Assets                          0.118  0.350  0.171       
## Total_Gross_Income                                                        
## Value_of_Self_employment_Businesses                                       
## Income_From_Other_Sources                                                 
## Credit_Card_Debt                                                          
## Monthly_Amount_Paid_As_Rent                                               
## Total_Value_of_Cars                                                       
## Value_Of_Other_Vehicles                         0.920                     
## Value_Of_Other_Valuables                                                  
## No_of_PrivateLoans                                                        
## Value_of_Saving_Accounts                                             0.955
## Amount_spent_on_Food_at_Home                                              
## Amount_Spent_on_Food_Outside_Home       -0.172                            
## AMount_Spent_on_Utilities                                                 
## Amount_Spent_on_Consumer_Goods_Services                                   
##                                         RC9    RC11   RC10   RC7    RC15  
## Value_of_Household_Vehicles                                               
## Valuables                                                                 
## Deposits                                                             0.137
## Mutual_Funds                                                              
## Bonds                                                                     
## Employee_Income                                                           
## Self_Employment_income                                                    
## Rental_Income                                                             
## Financial_assets_Income                                                   
## Pension_Income                                                            
## Total_Real_Assets                       -0.259                       0.129
## Total_Financial_Assets                                                    
## Total_Gross_Income                             -0.141                     
## Value_of_Self_employment_Businesses                                       
## Income_From_Other_Sources                                     0.998       
## Credit_Card_Debt                                       0.996              
## Monthly_Amount_Paid_As_Rent              0.982                            
## Total_Value_of_Cars                                                       
## Value_Of_Other_Vehicles                                                   
## Value_Of_Other_Valuables                                                  
## No_of_PrivateLoans                              0.990                     
## Value_of_Saving_Accounts                                                  
## Amount_spent_on_Food_at_Home                                              
## Amount_Spent_on_Food_Outside_Home                                   -0.533
## AMount_Spent_on_Utilities                                            0.693
## Amount_Spent_on_Consumer_Goods_Services                                   
## 
##                  RC1   RC2   RC3   RC5  RC14   RC6   RC4  RC12  RC13   RC8
## SS loadings    3.182 2.795 2.324 1.814 1.715 1.337 1.260 1.177 1.116 1.068
## Proportion Var 0.122 0.108 0.089 0.070 0.066 0.051 0.048 0.045 0.043 0.041
## Cumulative Var 0.122 0.230 0.319 0.389 0.455 0.506 0.555 0.600 0.643 0.684
##                  RC9  RC11  RC10   RC7  RC15
## SS loadings    1.055 1.010 1.003 1.001 0.830
## Proportion Var 0.041 0.039 0.039 0.039 0.032
## Cumulative Var 0.725 0.764 0.802 0.841 0.873
print(pc$values)
##  [1]  5.519263e+00  2.555186e+00  1.992212e+00  1.659175e+00  1.345562e+00
##  [6]  1.311511e+00  1.063026e+00  1.033489e+00  9.752188e-01  9.606216e-01
## [11]  9.469383e-01  9.119303e-01  8.846765e-01  8.061963e-01  7.230774e-01
## [16]  6.555034e-01  6.462307e-01  5.033627e-01  4.068576e-01  3.666032e-01
## [21]  2.955986e-01  2.363339e-01  1.833365e-01  1.809050e-02  2.010047e-16
## [26] -2.410907e-16
# create scree plot (to help decide how many PCs to keep)
plot(pc$values,type="l",main="", xlab="# factors", ylab="Eigenvalue") # scree plot

The test of the hypothesis that 15 factors are sufficient shows that the degrees of freedom for the model is 40 and the objective function was 41.33. The number of observations was 8156, and the Chi-Square value was 336272.5 with prob <0, which indicates that the model is a good fit for the dataset.

The loadings table provides information about the correlation between each variable in the dataset and each of the principal components. The values in the table represent the factor loadings, which indicate the strength and direction of the relationship between the variables and the components. The higher the absolute value of the factor loading, the stronger the relationship between the variable and the component.The retained components are listed in the columns, and the variables are listed in the rows. The values in the table represent the correlation between the variable and the component, with values close to 1 or -1 indicating a strong correlation.

For example, the variable “Value_of_Household_Vehicles” has a loading of 0.27 on the first principal component (RC1), indicating a moderate positive correlation. The variable “Valuables” has a loading of 0.98 on RC2, indicating a strong positive correlation. The variable “Credit_Card_Debt” has a loading of 1 on RC4, indicating a perfect correlation. The variable “Amount_Spent_on_Food_Outside_Home” has a loading of -0.53 on RC15, indicating a moderate negative correlation.

The table also shows the eigenvalues and the proportion and cumulative proportion of variance explained by each component. The eigenvalues indicate the amount of variance in the data that is explained by each component. The proportion and cumulative proportion of variance explained indicate the proportion of total variance in the data that is explained by each component and the cumulative proportion of variance explained by each successive component. For example, the first principal component (RC1) explains 12.2% of the total variance in the data, while the first two components (RC1 and RC2) together explain 23% of the total variance.

The root mean square of the residuals (RMSA) is 0.04, which is low, indicating that the model has a good fit to the data.

Overall, the output suggests that 15 components are sufficient to explain the variability in the dataset. The loadings table shows that several variables are strongly correlated with the first few components, indicating that these components capture important information in the data.

Decision Trees

To identify the important factors that determine whether a person has private loans, deposits, real estate wealth, vehicles, financial assets, debt, or real assets we used decision trees.

We converted the categorical variables to factors and considered a subset of data to include variables, such as gender, age, employment status, education level, total gross income, and whether the household owns saving accounts or has credit card debt.

We then fit a decision tree model using the rpart() function to predict whether an individual has private loans, deposits, real estate wealth, vehicles, financial assets, debt, or real assets based on all available predictors, and then plotted the decision tree using the rpart.plot() function to visualize the important predictors that influence the target variable.

library(rpart)
library(rpart.plot)

#Subset the data
hcfs_subset <- hcfs[, c("Gender", "Age", "Employment_status", "Education_Level", 
                        "Total_Gross_Income", "Has_Debt", "Household_Owns_Saving_accounts", 
                        "Has_Credit_Card_Debt",
                        "Has_Mutual_Funds", "Has_Shares", "Has_Bonds","Has_Real_Assets", "Has_Financial_Assets", "Has_Vehicles", "Has_Valuables", 
              "Has_Real_Estate_Wealth", "Has_Deposits",
              "Has_Private_Loans", 
              "Has_Applied_for_Loan_Credit")]

# Convert categorical variables to factors
hcfs_subset$Gender <- as.factor(hcfs_subset$Gender)
hcfs_subset$Employment_status <- as.factor(hcfs_subset$Employment_status)
hcfs_subset$Education_Level <- as.factor(hcfs_subset$Education_Level)
hcfs_subset$Has_Debt <- as.factor(hcfs_subset$Has_Debt)
hcfs_subset$Household_Owns_Saving_accounts <- as.factor(hcfs_subset$Household_Owns_Saving_accounts)
hcfs_subset$Has_Credit_Card_Debt <- as.factor(hcfs_subset$Has_Credit_Card_Debt)
hcfs_subset$Has_Mutual_Funds <- as.factor(hcfs_subset$Has_Mutual_Funds)
hcfs_subset$Has_Shares <- as.factor(hcfs_subset$Has_Shares)
hcfs_subset$Has_Bonds <- as.factor(hcfs_subset$Has_Bonds)
hcfs_subset$Has_Real_Assets <- as.factor(hcfs_subset$Has_Real_Assets)
hcfs_subset$Has_Financial_Assets <- as.factor(hcfs_subset$Has_Financial_Assets)
hcfs_subset$Has_Vehicles <- as.factor(hcfs_subset$Has_Vehicles)
hcfs_subset$Has_Real_Estate_Wealth <- as.factor(hcfs_subset$Has_Real_Estate_Wealth)
hcfs_subset$Has_Deposits <- as.factor(hcfs_subset$Has_Deposits)
hcfs_subset$Has_Private_Loans <- as.factor(hcfs_subset$Has_Private_Loans)
hcfs_subset$Has_Applied_for_Loan_Credit <- as.factor(hcfs_subset$Has_Applied_for_Loan_Credit)